In Part 2, we built a concurrent B+Tree index. But thereโs a fundamental problem with our approach.
Readers block writers. Writers block readers.
// Current implementation
let lock = page.read(); // Reader acquires lock
// Writer waits... and waits... and waits...
// Reader still holding lock (maybe computing something expensive)
// Writer: ๐ญ
This is unacceptable for a real database. PostgreSQL handles thousands of concurrent transactions with readers never blocking writers. How?
MVCC: Multi-Version Concurrency Control.
Today: implementing MVCC in Rust with snapshot isolation, transaction management, and confronting the nightmare of transaction ID wraparound.
1 The MVCC Insight
The Problem: Locking Is Too Restrictive
Traditional locking (2PL):
Transaction A: SELECT * FROM users WHERE id = 1 -- Reads row X
Transaction B: UPDATE users SET balance = 100 WHERE id = 1 -- Blocked!
Transaction A: (still reading, maybe for 10 seconds)
Transaction B: ๐ก Still blocked!
MVCC approach:
Transaction A: SELECT * FROM users WHERE id = 1
-- Sees version 1 of row X (old but consistent)
Transaction B: UPDATE users SET balance = 100 WHERE id = 1
-- Creates version 2 of row X
-- Not blocked!
Both transactions proceed without blocking each other.
How MVCC Works
Every row has multiple versions:
Row: users.id = 1
Version 1: {id: 1, balance: 50, xmin: 100, xmax: NULL}
โ โ โ
Data Created Still visible
by txn 100 (not deleted)
Version 2: {id: 1, balance: 100, xmin: 200, xmax: NULL}
โ โ
New data Created by txn 200
Version 3: {id: 1, balance: 150, xmin: 300, xmax: 400}
โ โ โ
Old data Created Deleted by
by txn 300 txn 400
Transaction metadata:
| Field | Meaning |
|---|---|
xmin | Transaction ID that created this version |
xmax | Transaction ID that deleted this version (NULL = still alive) |
cmin | Command ID within transaction (for statement-level visibility) |
cmax | Command ID that deleted this version |
2 Transaction IDs and Snapshots
Transaction ID Allocation
// src/transaction/txn_id.rs
pub type TransactionId = u32;
pub const INVALID_XID: TransactionId = 0;
pub const FIRST_NORMAL_XID: TransactionId = 3; // 0, 1, 2 are bootstrap
pub struct TransactionIdGenerator {
next_xid: AtomicU32,
}
impl TransactionIdGenerator {
pub fn new() -> Self {
Self {
next_xid: AtomicU32::new(FIRST_NORMAL_XID),
}
}
pub fn next(&self) -> TransactionId {
self.next_xid.fetch_add(1, Ordering::SeqCst)
}
pub fn current(&self) -> TransactionId {
self.next_xid.load(Ordering::SeqCst)
}
}
Problem: u32 wraps at 4 billion. What happens then?
Answer: Catastrophe. Weโll handle this later.
Snapshots: The Heart of MVCC
A snapshot captures which transactions are visible:
// src/transaction/snapshot.rs
pub struct Snapshot {
pub xmin: TransactionId, // Oldest active transaction
pub xmax: TransactionId, // Next transaction ID (nothing >= xmax is visible)
pub active_transactions: Vec<TransactionId>, // Transactions in progress
}
impl Snapshot {
pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>) -> bool {
// Row created by transaction < xmin? Always visible
if row_xmin < self.xmin {
// Unless deleted by transaction >= xmin
return row_xmax.map_or(true, |xmax| xmax >= self.xmax);
}
// Row created by transaction >= xmax? Never visible
if row_xmin >= self.xmax {
return false;
}
// Row created by active transaction? Not visible (unless it's ours)
if self.active_transactions.contains(&row_xmin) {
return false;
}
// Row deleted by active transaction? Still visible
if let Some(xmax) = row_xmax {
if xmax < self.xmax && !self.active_transactions.contains(&xmax) {
return false; // Deleted by committed transaction
}
}
true
}
}
Visual example:
Time: 100 150 200 250 300 350
โ โ โ โ โ โ
Txn 100: [====== committed ======]
Txn 150: [========= active =========]
Txn 200: [== committed ==]
โ
Snapshot taken here
xmin=150, xmax=251, active=[150, 200]
Visibility rules:
- Row created by txn 100: VISIBLE (committed before snapshot)
- Row created by txn 150: NOT VISIBLE (still active)
- Row created by txn 200: NOT VISIBLE (committed after snapshot started)
- Row created by txn 250: NOT VISIBLE (>= xmax)
3 Row Layout with MVCC
Extended Page Header
// src/storage/mvcc_page.rs
use crate::transaction::{TransactionId, CommandId};
pub const MVCC_ROW_HEADER_SIZE: usize = 16;
#[repr(C)]
pub struct MvccRowHeader {
pub xmin: TransactionId, // 4 bytes
pub xmax: TransactionId, // 4 bytes (0 = not deleted)
pub cmin: CommandId, // 4 bytes
pub cmax: CommandId, // 4 bytes
}
pub struct MvccRow {
header: MvccRowHeader,
data: [u8], // Variable length
}
Page layout with MVCC:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PageHeader (24 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ItemId array (4 bytes each) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Free space โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Row 0: {xmin, xmax, cmin, cmax} + data โ
โ Row 1: {xmin, xmax, cmin, cmax} + data โ
โ Row 2: {xmin, xmax, cmin, cmax} + data โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Insert: Creating a New Version
// src/transaction/mvcc_operations.rs
impl MvccTable {
pub fn insert(&self, txn: &Transaction, data: &[u8]) -> Result<(), TableError> {
// Get a page with space
let mut page = self.get_page_for_insert()?;
// Create row with transaction metadata
let header = MvccRowHeader {
xmin: txn.xid,
xmax: 0, // Not deleted
cmin: txn.current_command_id(),
cmax: 0,
};
// Insert into page
let row_id = page.insert_mvcc_row(header, data)?;
// Mark page as dirty (needs WAL)
self.buffer_pool.mark_dirty(page.id());
Ok(())
}
}
WAL record for insert:
#[derive(Debug, Clone)]
pub struct WalRecordInsert {
pub page_id: u64,
pub offset: u16,
pub xmin: TransactionId,
pub data: Vec<u8>,
}
Update: Creating a New Version and Marking Old
Update is actually delete + insert:
impl MvccTable {
pub fn update<F>(&self, txn: &Transaction, row_id: RowId, modify: F) -> Result<(), TableError>
where
F: FnOnce(&[u8]) -> Vec<u8>,
{
// 1. Find old version
let old_row = self.get_row(row_id)?;
// 2. Check visibility (can only update visible rows)
if !txn.snapshot.is_visible(old_row.xmin, old_row.xmax) {
return Err(TableError::RowNotVisible);
}
// 3. Mark old version as deleted
let mut old_page = self.get_page(row_id.page_id)?;
old_page.set_xmax(row_id.offset, txn.xid);
old_page.set_cmax(row_id.offset, txn.current_command_id());
// 4. Create new version with updated data
let new_data = modify(&old_row.data);
let new_row_id = self.insert(txn, &new_data)?;
// 5. Log to WAL
self.wal.log_update(old_row_id, new_row_id, txn.xid)?;
Ok(())
}
}
Select: Visibility Check
impl MvccTable {
pub fn scan(&self, txn: &Transaction) -> impl Iterator<Item = Row> + '_ {
self.pages.iter().flat_map(move |page| {
page.rows().filter_map(move |row| {
if txn.snapshot.is_visible(row.xmin, row.xmax) {
Some(Row { data: row.data })
} else {
None // Invisible version
}
})
})
}
}
Example scenario:
Transaction A (xid=100): Transaction B (xid=200):
1. BEGIN;
2. INSERT INTO users VALUES (1, 50);
3. COMMIT;
4. BEGIN; (snapshot: xmin=201, xmax=201)
5. SELECT * FROM users;
โ Sees version from txn 100 โ
6. BEGIN;
7. UPDATE users SET balance = 100 WHERE id = 1;
(creates version 2, marks version 1 as deleted)
8. (not committed yet)
9. SELECT * FROM users;
โ Still sees version 1! (txn 200 not committed)
10. COMMIT;
11. SELECT * FROM users;
โ Now sees version 2 โ
4 Transaction States and Visibility
Transaction Lifecycle
Visibility Matrix
| Row State | Transaction State | Visible? |
|---|---|---|
xmin < snapshot.xmin, xmax = 0 | Committed | โ Yes |
xmin < snapshot.xmin, xmax < snapshot.xmin | Committed | โ No (deleted) |
xmin < snapshot.xmin, xmin in active | In Progress | โ No |
xmin >= snapshot.xmax | Any | โ No (future) |
xmin = current_txn | Current | โ Yes (own changes) |
5 VACUUM: Cleaning Up Dead Versions
The Problem: Dead Tuples Accumulate
After many updates:
Page:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Row 0: {xmin: 100, xmax: 200} โ Dead (both committed) โ
โ Row 1: {xmin: 300, xmax: 0} โ Live โ
โ Row 2: {xmin: 150, xmax: 250} โ Dead โ
โ Row 3: {xmin: 400, xmax: 0} โ Live โ
โ ... โ
โ 50% dead space! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Without VACUUM: Table grows forever. Performance degrades.
VACUUM Process
// src/transaction/vacuum.rs
pub struct VacuumWorker {
buffer_pool: Arc<BufferPool>,
transaction_manager: Arc<TransactionManager>,
}
impl VacuumWorker {
pub fn vacuum_table(&self, table_id: u64) -> Result<VacuumStats, VacuumError> {
let mut stats = VacuumStats::default();
for page_id in self.get_table_pages(table_id) {
let mut page = self.buffer_pool.get_page(page_id)?;
// Get global xmin (oldest active transaction)
let global_xmin = self.transaction_manager.get_global_xmin();
// Scan all rows
for row_id in page.rows() {
let row = page.get_row(row_id);
// Row is dead if xmax is committed
if row.xmax != 0 && row.xmax < global_xmin {
// Mark as reusable
page.mark_row_free(row_id);
stats.dead_tuples_removed += 1;
}
}
self.buffer_pool.mark_dirty(page_id);
}
Ok(stats)
}
}
VACUUM doesnโt lock:
| Operation | Lock Level |
|---|---|
| Regular VACUUM | ShareUpdateExclusiveLock (allows reads/writes) |
| VACUUM FULL | AccessExclusiveLock (blocks everything) |
VACUUM FULL vs. Regular VACUUM
Regular VACUUM:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Before: [Dead][Live][Dead][Live][Dead][Live] โ
โ After: [Free][Live][Free][Live][Free][Live] โ
โ (space reusable for new rows in same page) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
VACUUM FULL:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Before: [Dead][Live][Dead][Live][Dead][Live] โ
โ After: [Live][Live][Live][Free][Free][Free] โ
โ (compacted, dead tuples physically removed) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pub fn vacuum_full(&self, table_id: u64) -> Result<(), VacuumError> {
// 1. Acquire exclusive lock
let _lock = self.lock_table(table_id, LockMode::AccessExclusive);
// 2. Create new table file
let new_table_id = self.create_temp_table();
// 3. Copy only live tuples
for row in self.scan_live_rows(table_id) {
self.insert_into_table(new_table_id, row);
}
// 4. Swap tables (atomic rename)
self.swap_tables(table_id, new_table_id);
// 5. Release lock
Ok(())
}
6 Transaction ID Wraparound: The 4 Billion Row Problem
The Math
TransactionId = u32 // 0 to 4,294,967,295
At 1000 transactions/second:
4,294,967,295 / 1000 = 4,294,967 seconds = ~50 days
After 50 days: OVERFLOW! ๐ฑ
What Happens on Wraparound
Before wraparound:
Transaction 4,294,967,294: INSERT INTO users VALUES (1, 100);
Transaction 4,294,967,295: INSERT INTO users VALUES (2, 200);
Transaction 0 (wrapped): SELECT * FROM users;
โ Sees xid 4B as "older" than 0!
โ Wrong visibility! CORRUPTION!
PostgreSQLโs solution: 2-phase transaction IDs
// Transaction ID comparison with wraparound handling
pub fn transaction_id_precedes(id1: TransactionId, id2: TransactionId) -> bool {
// Treat as signed 32-bit integers
// This makes the comparison wrap-aware
(id1 as i32 - id2 as i32) < 0
}
// Example:
// 4,294,967,294 as i32 = -2
// 0 as i32 = 0
// -2 < 0 โ true (4B precedes 0) โ
Freezing Old Transactions
Vacuum freeze: Mark very old transactions as โfrozenโ
pub const FROZEN_XID: TransactionId = 2; // Special value
pub fn vacuum_freeze(&self, table_id: u64, freeze_limit: TransactionId) -> Result<(), VacuumError> {
for page_id in self.get_table_pages(table_id) {
let mut page = self.get_page(page_id)?;
for row_id in page.rows() {
let row = page.get_row(row_id);
// If xmin is old enough, freeze it
if row.xmin < freeze_limit && row.xmin != FROZEN_XID {
page.set_xmin(row_id, FROZEN_XID);
}
// If xmax is old enough, freeze it
if row.xmax != 0 && row.xmax < freeze_limit && row.xmax != FROZEN_XID {
page.set_xmax(row_id, FROZEN_XID);
}
}
}
Ok(())
}
Frozen rows are always visible:
impl Snapshot {
pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>) -> bool {
// Frozen rows are always visible
if row_xmin == FROZEN_XID {
return row_xmax.map_or(true, |xmax| xmax == FROZEN_XID);
}
// ... normal visibility logic ...
}
}
Autovacuum: Preventing Wraparound Automatically
// src/transaction/autovacuum.rs
pub struct AutovacuumLauncher {
transaction_manager: Arc<TransactionManager>,
vacuum_worker: Arc<VacuumWorker>,
}
impl AutovacuumLauncher {
pub fn run(&self) {
loop {
// Check how old the oldest transaction is
let oldest_xmin = self.transaction_manager.get_oldest_xmin();
let current_xid = self.transaction_manager.current_xid();
// Distance to wraparound
let distance_to_wraparound = TransactionId::MAX - current_xid + oldest_xmin;
// If getting close, trigger vacuum
if distance_to_wraparound < WRAPAROUND_EMERGENCY_THRESHOLD {
self.vacuum_worker.vacuum_freeze_all_tables();
}
sleep(Duration::from_secs(60));
}
}
}
PostgreSQLโs default thresholds:
| Parameter | Default | Meaning |
|---|---|---|
autovacuum_vacuum_threshold | 50 | Min dead tuples before vacuum |
autovacuum_vacuum_scale_factor | 0.2 | +20% of table size |
autovacuum_freeze_max_age | 200M | Max transactions before forced freeze |
7 Isolation Levels
ANSI SQL Isolation Levels
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | โ Prevented | Possible | Possible |
| Repeatable Read | โ Prevented | โ Prevented | Possible |
| Serializable | โ Prevented | โ Prevented | โ Prevented |
PostgreSQLโs Implementation
PostgreSQL uses MVCC for all isolation levels:
| Isolation Level | Implementation |
|---|---|
| Read Uncommitted | Same as Read Committed |
| Read Committed | Fresh snapshot per statement |
| Repeatable Read | Single snapshot per transaction |
| Serializable | Single snapshot + predicate locking |
#[derive(Debug, Clone, Copy)]
pub enum IsolationLevel {
ReadCommitted,
RepeatableRead,
Serializable,
}
impl Transaction {
pub fn get_snapshot(&self) -> Snapshot {
match self.isolation_level {
IsolationLevel::ReadCommitted => {
// New snapshot for each statement
self.transaction_manager.create_snapshot()
}
IsolationLevel::RepeatableRead | IsolationLevel::Serializable => {
// Reuse same snapshot for entire transaction
self.cached_snapshot.clone()
}
}
}
}
Serializable Isolation: Predicate Locking
// Simplified predicate locking
pub struct SerializableTransaction {
txn: Transaction,
read_predicates: Vec<Predicate>, // Ranges/conditions read
write_set: Vec<RowId>, // Rows written
}
pub struct Predicate {
pub page_id: u64,
pub key_range: Option<(BTreeKey, BTreeKey)>, // None = full scan
}
impl TransactionManager {
pub fn check_serializable_conflict(&self, txn: &SerializableTransaction) -> Result<(), SerializationError> {
// Check if any committed write conflicts with our reads
for predicate in &txn.read_predicates {
for write in self.recent_writes() {
if predicate.matches(&write) && write.committed_after(txn.snapshot.xmax) {
return Err(SerializationError::ReadWriteConflict);
}
}
}
Ok(())
}
}
On conflict: Abort one transaction with serialization_failure error.
8 Challenges Building in Rust
Challenge 1: Snapshot Lifetime
Problem: Snapshots need to outlive the transaction that created them.
// โ Doesn't work
pub fn begin_transaction(&self) -> Transaction {
let snapshot = self.create_snapshot(); // Borrowed from self
Transaction { snapshot, ... } // Snapshot doesn't live long
}
Solution: Owned snapshots
// โ
Works
pub fn begin_transaction(&self) -> Transaction {
let snapshot = self.create_snapshot(); // Returns owned Snapshot
Transaction {
snapshot: Arc::new(snapshot), // Shareable across threads
...
}
}
Challenge 2: Atomic Transaction State
Problem: Multiple threads need to see consistent transaction state.
// โ Race condition
pub fn commit(&self, txn: &mut Transaction) {
txn.state = TransactionState::Committed; // Not atomic!
// Other threads might see partial state
}
Solution: Atomic state with proper ordering
// โ
Works
pub struct Transaction {
pub xid: TransactionId,
pub state: AtomicU8, // Use atomic for state
pub snapshot: Arc<Snapshot>,
}
impl Transaction {
pub fn commit(&self) {
// 1. Write WAL first (durable)
self.wal.log_commit(self.xid)?;
// 2. Then mark as committed (visible)
self.state.store(TransactionState::Committed as u8, Ordering::Release);
// 3. Notify waiting transactions
self.transaction_manager.notify_committed(self.xid);
}
}
Challenge 3: Vacuum Without Blocking
Problem: How to vacuum while transactions are reading?
// โ Blocks readers
pub fn vacuum(&self) {
let _lock = self.table_lock.write(); // Exclusive lock
self.remove_dead_tuples();
}
Solution: Two-phase vacuum
// โ
Non-blocking
pub fn vacuum(&self) {
// Phase 1: Mark tuples as prune-able (no lock needed)
let global_xmin = self.get_global_xmin();
self.mark_pruneable(global_xmin);
// Phase 2: Reclaim space (uses page-level locks, not table lock)
for page in self.pages.iter() {
let _page_lock = page.lock.write();
self.reclaim_space_on_page(page);
}
}
9 How AI Accelerated This
What AI Got Right
| Task | AI Contribution |
|---|---|
| Visibility rules | Generated correct xmin/xmax logic |
| Wraparound handling | Explained 2โs complement trick |
| Snapshot structure | Suggested xmin/xmax/active pattern |
| VACUUM design | Outlined two-phase approach |
What AI Got Wrong
| Issue | What Happened |
|---|---|
| Initial visibility | First draft didnโt handle own uncommitted writes |
| Freeze logic | Missed that frozen rows need special handling in visibility |
| Serializable isolation | Suggested full serializability without predicate locking (wrong!) |
Pattern: MVCC is subtle. AI gets the 80% case. Edge cases require deep understanding.
Example: Debugging a Visibility Bug
My question to AI:
โTransaction A inserts a row, then selects it. But the select doesnโt see the row. Why?โ
What I learned:
- Transactions must see their own uncommitted writes
- Need to track
current_transaction_idin snapshot - Visibility check needs special case for
row_xmin == my_xid
Result: Fixed is_visible():
pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>, my_xid: TransactionId) -> bool {
// Special case: see your own writes
if row_xmin == my_xid {
return row_xmax.map_or(true, |xmax| xmax != my_xid);
}
// ... rest of visibility logic ...
}
Summary: MVCC in One Diagram
Key Takeaways:
| Concept | Why It Matters |
|---|---|
| MVCC | Readers donโt block writers, writers donโt block readers |
| Snapshots | Consistent view of data at a point in time |
| Transaction IDs | Track version creation/deletion |
| VACUUM | Reclaim space from dead versions |
| Wraparound | 4 billion transaction limit requires freezing |
| Isolation levels | Trade-off between consistency and concurrency |
Further Reading:
- PostgreSQL Source:
src/backend/access/heap/heapam_visibility.c - PostgreSQL Source:
src/backend/access/transam/ - โA Critique of ANSI SQL Isolation Levelsโ by Berenson et al. (1995)
- โDatabase Management Systemsโ by Ramakrishnan (Ch. 16: Concurrency Control)
- โThe PostgreSQL Bookโ by Worsley & Morin (Ch. 13: MVCC)
- Vaultgres Repository: github.com/neoalienson/Vaultgres
Comments
Please accept the "Functionality" cookie category to view and post comments.
Comments failed to load. You can try again or view the discussion directly on GitHub.
View on GitHub