In Part 1, we built the foundation: page-based storage and a buffer pool. But thereโs a problem. Finding a row requires a full table scan:
// Without an index: O(n)
for page in table_pages {
for row in page.rows {
if row.id == 42 {
return row; // Found it! (maybe on the last page)
}
}
}
For a table with 1 million rows? Thatโs 1 million comparisons. On disk? Unacceptable. Real databases use indexes to find rows in O(log n) time. PostgreSQLโs default: B+Tree. Today: implementing a thread-safe B+Tree index in Rust, integrated with our buffer pool. And yes, concurrent access is as hard as it sounds.
1 Why B+Tree?
The Alternatives
| Index Type | Lookup | Range Scan | Insert | Use Case |
|---|---|---|---|---|
| Hash Table | O(1) | โ Not supported | O(1) | Exact match only |
| B+Tree | O(log n) | โ Excellent | O(log n) | General purpose |
| LSM Tree | O(log n) | โ ๏ธ Compaction needed | O(1) amortized | Write-heavy (RocksDB) |
| Skip List | O(log n) | โ Good | O(log n) | In-memory (Redis) |
| PostgreSQL chose B+Tree because: | ||||
| Reason | Why It Matters | |||
| -------- | ---------------- | |||
| Balanced | All leaf nodes at same depthโpredictable performance | |||
| Range queries | Leaf nodes linkedโefficient WHERE id BETWEEN 10 AND 100 | |||
| Disk-friendly | High fanout (100s of keys per node)โshallow trees | |||
| Self-balancing | No manual reorganization needed |
B+Tree Structure
โโโโโโโโโโโโโโโโโโโ
โ Root Node โ โ Page 0
โ [10] โ [50] โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Node [10] โ โ Node [50] โ โ Node [โ] โ โ Page 1, 2, 3
โ [1,5,7]โ[10] โ โ [20,30]โ[50] โ โ [60,80,90] โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Leaf [1,5,7] โ โLeaf [10,20,30โ โLeaf [50,60, โ โ Page 4, 5, 6
โ โ next: 5 โ โ โ next: 6 โ โ 80,90] โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
Key properties:
| Property | Value in Vaultgres |
|---|---|
| Order (fanout) | ~500 keys per internal node (8KB page) |
| Height | 3-4 levels for 1 billion keys |
| Leaf nodes | Contain actual data pointers (TIDs) |
| Internal nodes | Only keys + child pointers (routing) |
| Linked leaves | Each leaf points to nextโrange scans without tree traversal |
2 Node Layout: Fitting B+Tree into Pages
Internal Node Structure
Each node fits in one 8KB page:
// src/index/btree_node.rs
use crate::storage::page::{Page, PAGE_SIZE, PAGE_HEADER_SIZE};
pub const BTREE_NODE_HEADER_SIZE: usize = 32;
pub const MAX_KEYS_PER_NODE: usize = (PAGE_SIZE - BTREE_NODE_HEADER_SIZE) / 12; // ~680 keys
#[repr(C)]
pub struct BTreeNodeHeader {
pub is_leaf: bool, // 1 byte
pub key_count: u16, // 2 bytes
pub parent_page: u64, // 8 bytes
pub right_sibling: u64, // 8 bytes (for leaf nodes)
pub level: u16, // 2 bytes (distance from leaf)
_padding: [u8; 10], // Pad to 32 bytes
}
pub struct BTreeNode {
page: Page,
}
impl BTreeNode {
pub fn new_leaf() -> Self {
let mut page = Page::new();
let header = BTreeNodeHeader {
is_leaf: true,
key_count: 0,
parent_page: 0,
right_sibling: 0,
level: 0,
_padding: [0; 10],
};
// Write header to page...
Self { page }
}
pub fn new_internal(level: u16) -> Self {
let mut page = Page::new();
let header = BTreeNodeHeader {
is_leaf: false,
key_count: 0,
parent_page: 0,
right_sibling: 0,
level,
_padding: [0; 10],
};
Self { page }
}
}
Key layout inside a page:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PageHeader (24 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ BTreeNodeHeader (32 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Keys (variable size, sorted) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Child Pointers (8 bytes each) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Free space โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key and Value Format
For an index on users.id (integer):
pub struct BTreeKey {
data: Vec<u8>, // Serialized key (e.g., 4 bytes for i32)
}
impl BTreeKey {
pub fn from_i32(value: i32) -> Self {
Self {
data: value.to_le_bytes().to_vec(),
}
}
pub fn to_i32(&self) -> i32 {
i32::from_le_bytes(self.data.try_into().unwrap())
}
}
// Leaf node value: points to actual row (TID = Table ID + Page + Offset)
pub struct BTreeValue {
pub table_id: u64,
pub page_num: u32,
pub offset: u16,
}
3 Basic B+Tree Operations
Search: Finding a Key
impl BTreeIndex {
pub fn search(&self, key: &BTreeKey) -> Option<BTreeValue> {
let mut current_page = self.root_page;
loop {
let node = self.get_node(current_page)?;
if node.is_leaf() {
// Found leafโsearch for key
return node.search_leaf(key);
} else {
// Internal nodeโfind child to descend
let child_idx = node.find_child_index(key);
current_page = node.get_child_pointer(child_idx);
}
}
}
}
impl BTreeNode {
pub fn search_leaf(&self, key: &BTreeKey) -> Option<BTreeValue> {
// Binary search within leaf
let idx = self.binary_search(key)?;
self.get_value(idx)
}
fn binary_search(&self, key: &BTreeKey) -> Option<usize> {
let mut low = 0;
let mut high = self.key_count();
while low < high {
let mid = (low + high) / 2;
match self.get_key(mid).cmp(key) {
std::cmp::Ordering::Equal => return Some(mid),
std::cmp::Ordering::Less => low = mid + 1,
std::cmp::Ordering::Greater => high = mid,
}
}
None
}
}
Complexity: O(log_f n) where f = fanout (~500). For 1 billion keys: ~4 page accesses.
Insert: The Hard Part
Insertion is where B+Trees get complicated:
Step-by-step example:
Initial state (order = 3 for simplicity):
โโโโโโโโโโโโโโโโโโโ
โ Root: [50] โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโดโโโโโ
โ โ
โผ โผ
โโโโโโโโโ โโโโโโโโโ
โ[10,30]โ โ[60,80]โ โ Leaf full!
โโโโโโโโโ โโโโโโโโโ
Insert 70:
1. Find leaf: [60,80]
2. Leaf is fullโsplit!
3. New leaves: [60] and [70,80]
4. Promote 70 to parent
Result:
โโโโโโโโโโโโโโโโโโโโโ
โ Root: [50]โ[70] โ
โโโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโผโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโ โโโโโโ โโโโโโโโ
โ[10]โ โ[30]โ โ[60,80โ โ Wait, that's wrong...
โโโโโโ โโโโโโ โโโโโโโโ
Correct split:
impl BTreeNode {
pub fn split(&mut self) -> (BTreeNode, BTreeKey) {
let mid = self.key_count() / 2;
let separator = self.get_key(mid);
// Create new sibling node
let mut new_node = BTreeNode::new_leaf();
// Move half the keys to new node
for i in mid..self.key_count() {
new_node.insert(self.get_key(i), self.get_value(i));
}
// Truncate original node
self.truncate(mid);
// Set sibling pointers
new_node.set_right_sibling(self.right_sibling());
self.set_right_sibling(new_node.page_id());
(new_node, separator)
}
}
4 Concurrent Access: The Real Challenge
The Problem: Locking a Tree
Naive approach: lock the entire tree
// โ Terrible performance
pub fn insert(&self, key: BTreeKey, value: BTreeValue) {
let _guard = self.global_lock.lock().unwrap();
// ... do insert ...
}
Result: One operation at a time. Defeats the purpose of a database.
Better: Lock Coupling (Crabbing)
Lock coupling: Hold locks on at most two adjacent nodes during traversal.
Rust implementation:
// src/index/btree_concurrent.rs
use std::sync::Arc;
use parking_lot::RwLock; // Better than std::sync::RwLock
pub struct BTreeIndex {
root_page: u64,
buffer_pool: Arc<BufferPool>,
// Each node is protected by its own lock
node_locks: Arc<DashMap<u64, Arc<RwLock<()>>>>, // page_id โ lock
}
impl BTreeIndex {
pub fn insert(&self, key: BTreeKey, value: BTreeValue) -> Result<(), BTreeError> {
let mut current_page = self.root_page;
let mut parent_lock: Option<Arc<RwLock<()>>> = None;
let mut current_lock = self.get_node_lock(current_page);
loop {
// Acquire read lock on current node
let read_guard = current_lock.read();
let node = self.get_node(current_page)?;
if node.is_leaf() {
// Upgrade to write lock
drop(read_guard);
let write_guard = current_lock.write();
// Check if split needed
if node.is_full() {
// Need parent lock for split
if let Some(parent_lock) = parent_lock {
let _parent_guard = parent_lock.write();
self.split_leaf(current_page, parent_lock)?;
} else {
// Splitting root
self.split_root(current_page)?;
}
}
// Insert into leaf
self.insert_into_leaf(current_page, key, value)?;
return Ok(());
} else {
// Internal nodeโdescend
let child_idx = node.find_child_index(&key);
let child_page = node.get_child_pointer(child_idx);
// Lock child before releasing current (lock coupling)
let child_lock = self.get_node_lock(child_page);
let _child_read = child_lock.read();
// Release current lock
drop(read_guard);
drop(current_lock);
// Move down
parent_lock = Some(child_lock.clone());
current_lock = child_lock;
current_page = child_page;
}
}
}
}
Problem: Upgrading from read lock to write lock while holding other locks can deadlock.
Solution in Vaultgres: Use parking_lot::RwLock with upgradable_read() or release and re-acquire (with optimistic retry).
The Split Nightmare
Scenario: Two threads split the same node
Thread A: Thread B:
1. Lock parent P
2. Lock leaf L
3. Split L โ L1, L2
4. Insert separator in P
5. Lock parent P (waits!)
5. Lock leaf L (already split!)
6. ??? CORRUPTION ???
Solution: Check conditions after acquiring locks
pub fn split_leaf(&self, leaf_page: u64, parent_lock: Arc<RwLock<()>>) -> Result<(), BTreeError> {
let parent_guard = parent_lock.write();
// Re-check: is this leaf still full?
let leaf = self.get_node(leaf_page)?;
if !leaf.is_full() {
// Another thread already splitโnothing to do!
return Ok(());
}
// Proceed with split...
}
5 Range Scans: Leveraging Linked Leaves
The Problem with Tree Traversal
For SELECT * FROM users WHERE id BETWEEN 10 AND 100:
Without leaf links: Must traverse tree for each key. O(n log n).
With leaf links: Find start key once, then scan leaves. O(log n + k).
Implementation
pub struct BTreeScan {
current_leaf: u64,
current_idx: usize,
end_key: BTreeKey,
buffer_pool: Arc<BufferPool>,
}
impl Iterator for BTreeScan {
type Item = BTreeValue;
fn next(&mut self) -> Option<Self::Item> {
loop {
let leaf = self.get_leaf(self.current_leaf)?;
if self.current_idx < leaf.key_count() {
let key = leaf.get_key(self.current_idx);
// Check if we've passed the end
if key > &self.end_key {
return None;
}
let value = leaf.get_value(self.current_idx);
self.current_idx += 1;
return Some(value);
} else {
// Move to next leaf
self.current_leaf = leaf.right_sibling();
self.current_idx = 0;
if self.current_leaf == 0 {
return None; // End of tree
}
}
}
}
}
Usage:
let scan = index.scan_range(BTreeKey::from_i32(10), BTreeKey::from_i32(100));
for value in scan {
let row = buffer_pool.get_page(value.page_num);
// Process row...
}
6 Integration with Buffer Pool
Page Types
The buffer pool now handles multiple page types:
// src/storage/page_type.rs
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum PageType {
Heap, // Table data (from Part 1)
BTreeLeaf, // B+Tree leaf node
BTreeInternal, // B+Tree internal node
BTreeRoot, // B+Tree root
}
impl Page {
pub fn page_type(&self) -> PageType {
// Read from page header...
}
}
Memory Pressure
Problem: Index scans can evict hot data pages.
1. Index scan touches 1000 leaf pages
2. LRU evicts hot table pages to make room
3. Query needs table pagesโdisk I/O!
Solution: Clock-sweep with usage hints
// src/storage/buffer_pool.rs
pub struct BufferFrame {
// ... existing fields ...
pub usage_hint: UsageHint, // New!
}
#[derive(Debug, Clone, Copy)]
pub enum UsageHint {
Normal, // Standard LRU
IndexScan, // May be evicted sooner
Pinned, // Keep in memory (hot table)
}
impl BufferPool {
pub fn get_page_with_hint(&self, page_id: u64, hint: UsageHint) -> Option<Arc<Mutex<Page>>> {
// ... set hint on frame ...
}
}
7 Challenges Building in Rust
Challenge 1: Self-Referential Structures
Problem: Nodes need to reference their parent/children, but Rustโs borrow checker hates this.
// โ Doesn't compile
pub struct BTreeNode {
parent: Option<&BTreeNode>, // Reference to parent
children: Vec<&BTreeNode>, // References to children
}
Solution: Page IDs as indirect references
// โ
Works
pub struct BTreeNode {
parent_page: Option<u64>, // Page ID, not reference
children: Vec<u64>, // Page IDs
}
// Resolve page ID to node when needed
let parent = buffer_pool.get_page(self.parent_page?);
Challenge 2: Lock Ordering
Problem: Deadlock if threads acquire locks in different order.
Thread A: Lock page 5, then page 10
Thread B: Lock page 10, then page 5 โ Deadlock!
Solution: Always lock in consistent order (parent before child)
// Lock coupling enforces this naturally
pub fn descend(&self, parent_page: u64, child_page: u64) {
let parent_lock = self.get_lock(parent_page);
let child_lock = self.get_lock(child_page);
// Always acquire parent first
let _parent = parent_lock.read();
let _child = child_lock.read(); // Safeโconsistent order
}
Challenge 3: Splitting with WAL
Problem: A split touches multiple pages. How to make it atomic?
// โ Not atomic
self.write_node(left);
self.write_node(right); // โ Crash here = corruption!
self.update_parent();
Solution: WAL before modifying pages
pub fn split_node(&self, node_page: u64) -> Result<(), BTreeError> {
// 1. Write WAL record describing the split
let lsn = self.wal.log_split(node_page, left_data, right_data)?;
self.wal.flush()?; // Durable before proceeding
// 2. Now safe to modify pages
self.write_node(left);
self.write_node(right);
self.update_parent();
// 3. Log completion
self.wal.log_split_complete(lsn)?;
Ok(())
}
8 How AI Accelerated This
What AI Got Right
| Task | AI Contribution |
|---|---|
| Lock coupling algorithm | Explained the pattern with pseudocode |
| Split logic | Generated correct key redistribution |
| Rust patterns | Suggested parking_lot over std::sync |
| Debugging help | โThis deadlock happens becauseโฆโ |
What AI Got Wrong
| Issue | What Happened |
|---|---|
| Initial lock upgrade | Suggested RwLock::upgrade() which doesnโt exist in std |
| Page layout | First draft had keys and values interleaved (cache-unfriendly) |
| Split edge cases | Missed the โroot splitโ special case |
| Pattern: AI provides 80% of the solution. The remaining 20% requires deep understanding. |
Example: Debugging a Race Condition
My question to AI:
โTwo threads can split the same leaf simultaneously, causing duplicate keys. How does PostgreSQL prevent this?โ What I learned:
- PostgreSQL uses latch-based synchronization on buffer pins
- Before splitting, check if the split has already happened
- Use short-term locks released after each level
Result: Added the re-check logic in
split_leaf():
// Re-check after acquiring all locks
if !leaf.is_full() {
return Ok(()); // Another thread handled it
}
Summary: B+Tree in One Diagram
Key Takeaways:
| Concept | Why It Matters |
|---|---|
| B+Tree | O(log n) lookups, efficient range scans |
| Lock coupling | Fine-grained locking without deadlocks |
| Leaf links | Range scans without tree traversal |
| WAL integration | Atomic splits, crash recovery |
| Rust challenges | Borrow checker, lock ordering, self-referential structs |
Further Reading:
- PostgreSQL Source:
src/backend/access/nbtree/ - โThe Art of Computer Programming, Vol. 3โ by Knuth (B-Trees)
- โDatabase Management Systemsโ by Ramakrishnan (Ch. 10: Tree-Structured Indexing)
- โEfficient and Safe B+Tree Implementationโ by PostgreSQL contributors
- Vaultgres Repository: github.com/neoalienson/Vaultgres
Comments
Please accept the "Functionality" cookie category to view and post comments.
Comments failed to load. You can try again or view the discussion directly on GitHub.
View on GitHub