Database in Rust: MVCC and Transaction Management

In Part 2, we built a concurrent B+Tree index. But thereโ€™s a fundamental problem with our approach.

Readers block writers. Writers block readers.

// Current implementation
let lock = page.read();  // Reader acquires lock
// Writer waits... and waits... and waits...
// Reader still holding lock (maybe computing something expensive)
// Writer: ๐Ÿ˜ญ

This is unacceptable for a real database. PostgreSQL handles thousands of concurrent transactions with readers never blocking writers. How?

MVCC: Multi-Version Concurrency Control.

Today: implementing MVCC in Rust with snapshot isolation, transaction management, and confronting the nightmare of transaction ID wraparound.


1 The MVCC Insight

The Problem: Locking Is Too Restrictive

Traditional locking (2PL):

Transaction A: SELECT * FROM users WHERE id = 1  -- Reads row X
Transaction B: UPDATE users SET balance = 100 WHERE id = 1  -- Blocked!

Transaction A: (still reading, maybe for 10 seconds)
Transaction B: ๐Ÿ˜ก Still blocked!

MVCC approach:

Transaction A: SELECT * FROM users WHERE id = 1
               -- Sees version 1 of row X (old but consistent)

Transaction B: UPDATE users SET balance = 100 WHERE id = 1
               -- Creates version 2 of row X
               -- Not blocked!

Both transactions proceed without blocking each other.

How MVCC Works

Every row has multiple versions:

Row: users.id = 1

Version 1: {id: 1, balance: 50,  xmin: 100, xmax: NULL}
           โ†‘                    โ†‘       โ†‘
           Data                Created  Still visible
                             by txn 100  (not deleted)

Version 2: {id: 1, balance: 100, xmin: 200, xmax: NULL}
           โ†‘                     โ†‘
           New data          Created by txn 200

Version 3: {id: 1, balance: 150, xmin: 300, xmax: 400}
           โ†‘                     โ†‘        โ†‘
           Old data          Created   Deleted by
                             by txn 300  txn 400

Transaction metadata:

FieldMeaning
xminTransaction ID that created this version
xmaxTransaction ID that deleted this version (NULL = still alive)
cminCommand ID within transaction (for statement-level visibility)
cmaxCommand ID that deleted this version

2 Transaction IDs and Snapshots

Transaction ID Allocation

// src/transaction/txn_id.rs
pub type TransactionId = u32;

pub const INVALID_XID: TransactionId = 0;
pub const FIRST_NORMAL_XID: TransactionId = 3;  // 0, 1, 2 are bootstrap

pub struct TransactionIdGenerator {
    next_xid: AtomicU32,
}

impl TransactionIdGenerator {
    pub fn new() -> Self {
        Self {
            next_xid: AtomicU32::new(FIRST_NORMAL_XID),
        }
    }

    pub fn next(&self) -> TransactionId {
        self.next_xid.fetch_add(1, Ordering::SeqCst)
    }

    pub fn current(&self) -> TransactionId {
        self.next_xid.load(Ordering::SeqCst)
    }
}

Problem: u32 wraps at 4 billion. What happens then?

Answer: Catastrophe. Weโ€™ll handle this later.


Snapshots: The Heart of MVCC

A snapshot captures which transactions are visible:

// src/transaction/snapshot.rs
pub struct Snapshot {
    pub xmin: TransactionId,      // Oldest active transaction
    pub xmax: TransactionId,      // Next transaction ID (nothing >= xmax is visible)
    pub active_transactions: Vec<TransactionId>,  // Transactions in progress
}

impl Snapshot {
    pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>) -> bool {
        // Row created by transaction < xmin? Always visible
        if row_xmin < self.xmin {
            // Unless deleted by transaction >= xmin
            return row_xmax.map_or(true, |xmax| xmax >= self.xmax);
        }

        // Row created by transaction >= xmax? Never visible
        if row_xmin >= self.xmax {
            return false;
        }

        // Row created by active transaction? Not visible (unless it's ours)
        if self.active_transactions.contains(&row_xmin) {
            return false;
        }

        // Row deleted by active transaction? Still visible
        if let Some(xmax) = row_xmax {
            if xmax < self.xmax && !self.active_transactions.contains(&xmax) {
                return false;  // Deleted by committed transaction
            }
        }

        true
    }
}

Visual example:

Time:     100    150    200    250    300    350
          โ”‚      โ”‚      โ”‚      โ”‚      โ”‚      โ”‚
Txn 100:  [====== committed ======]
Txn 150:         [========= active =========]
Txn 200:                [== committed ==]
                          โ†‘
                    Snapshot taken here
                    xmin=150, xmax=251, active=[150, 200]

Visibility rules:
- Row created by txn 100: VISIBLE (committed before snapshot)
- Row created by txn 150: NOT VISIBLE (still active)
- Row created by txn 200: NOT VISIBLE (committed after snapshot started)
- Row created by txn 250: NOT VISIBLE (>= xmax)

3 Row Layout with MVCC

Extended Page Header

// src/storage/mvcc_page.rs
use crate::transaction::{TransactionId, CommandId};

pub const MVCC_ROW_HEADER_SIZE: usize = 16;

#[repr(C)]
pub struct MvccRowHeader {
    pub xmin: TransactionId,      // 4 bytes
    pub xmax: TransactionId,      // 4 bytes (0 = not deleted)
    pub cmin: CommandId,          // 4 bytes
    pub cmax: CommandId,          // 4 bytes
}

pub struct MvccRow {
    header: MvccRowHeader,
    data: [u8],  // Variable length
}

Page layout with MVCC:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PageHeader (24 bytes)                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ItemId array (4 bytes each)                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Free space                                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Row 0: {xmin, xmax, cmin, cmax} + data                      โ”‚
โ”‚ Row 1: {xmin, xmax, cmin, cmax} + data                      โ”‚
โ”‚ Row 2: {xmin, xmax, cmin, cmax} + data                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Insert: Creating a New Version

// src/transaction/mvcc_operations.rs
impl MvccTable {
    pub fn insert(&self, txn: &Transaction, data: &[u8]) -> Result<(), TableError> {
        // Get a page with space
        let mut page = self.get_page_for_insert()?;

        // Create row with transaction metadata
        let header = MvccRowHeader {
            xmin: txn.xid,
            xmax: 0,  // Not deleted
            cmin: txn.current_command_id(),
            cmax: 0,
        };

        // Insert into page
        let row_id = page.insert_mvcc_row(header, data)?;

        // Mark page as dirty (needs WAL)
        self.buffer_pool.mark_dirty(page.id());

        Ok(())
    }
}

WAL record for insert:

#[derive(Debug, Clone)]
pub struct WalRecordInsert {
    pub page_id: u64,
    pub offset: u16,
    pub xmin: TransactionId,
    pub data: Vec<u8>,
}

Update: Creating a New Version and Marking Old

Update is actually delete + insert:

flowchart TD A[UPDATE users SET balance = 100 WHERE id = 1] --> B[Find old version] B --> C[Mark old version as deleted] C --> D[Set xmax = current txn] D --> E[Set cmax = current command] E --> F[Insert new version] F --> G[New version has new xmin] G --> H[Commit: both versions visible until VACUUM]
impl MvccTable {
    pub fn update<F>(&self, txn: &Transaction, row_id: RowId, modify: F) -> Result<(), TableError>
    where
        F: FnOnce(&[u8]) -> Vec<u8>,
    {
        // 1. Find old version
        let old_row = self.get_row(row_id)?;

        // 2. Check visibility (can only update visible rows)
        if !txn.snapshot.is_visible(old_row.xmin, old_row.xmax) {
            return Err(TableError::RowNotVisible);
        }

        // 3. Mark old version as deleted
        let mut old_page = self.get_page(row_id.page_id)?;
        old_page.set_xmax(row_id.offset, txn.xid);
        old_page.set_cmax(row_id.offset, txn.current_command_id());

        // 4. Create new version with updated data
        let new_data = modify(&old_row.data);
        let new_row_id = self.insert(txn, &new_data)?;

        // 5. Log to WAL
        self.wal.log_update(old_row_id, new_row_id, txn.xid)?;

        Ok(())
    }
}

Select: Visibility Check

impl MvccTable {
    pub fn scan(&self, txn: &Transaction) -> impl Iterator<Item = Row> + '_ {
        self.pages.iter().flat_map(move |page| {
            page.rows().filter_map(move |row| {
                if txn.snapshot.is_visible(row.xmin, row.xmax) {
                    Some(Row { data: row.data })
                } else {
                    None  // Invisible version
                }
            })
        })
    }
}

Example scenario:

Transaction A (xid=100):          Transaction B (xid=200):
1. BEGIN;
2. INSERT INTO users VALUES (1, 50);
3. COMMIT;
                                   4. BEGIN; (snapshot: xmin=201, xmax=201)
                                   5. SELECT * FROM users;
                                      โ†’ Sees version from txn 100 โœ“
6. BEGIN;
7. UPDATE users SET balance = 100 WHERE id = 1;
   (creates version 2, marks version 1 as deleted)
8. (not committed yet)
                                   9. SELECT * FROM users;
                                      โ†’ Still sees version 1! (txn 200 not committed)
10. COMMIT;
                                   11. SELECT * FROM users;
                                       โ†’ Now sees version 2 โœ“

4 Transaction States and Visibility

Transaction Lifecycle

stateDiagram-v2 [*] --> InProgress: BEGIN InProgress --> Committed: COMMIT InProgress --> Aborted: ROLLBACK Committed --> [*] Aborted --> [*] note right of InProgress xmin/xmax set to transaction ID end note note right of Committed Versions become visible to new snapshots end note note right of Aborted All versions marked as never existed end note

Visibility Matrix

Row StateTransaction StateVisible?
xmin < snapshot.xmin, xmax = 0Committedโœ… Yes
xmin < snapshot.xmin, xmax < snapshot.xminCommittedโŒ No (deleted)
xmin < snapshot.xmin, xmin in activeIn ProgressโŒ No
xmin >= snapshot.xmaxAnyโŒ No (future)
xmin = current_txnCurrentโœ… Yes (own changes)

5 VACUUM: Cleaning Up Dead Versions

The Problem: Dead Tuples Accumulate

After many updates:

Page:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Row 0: {xmin: 100, xmax: 200} โ† Dead (both committed)       โ”‚
โ”‚ Row 1: {xmin: 300, xmax: 0}   โ† Live                        โ”‚
โ”‚ Row 2: {xmin: 150, xmax: 250} โ† Dead                        โ”‚
โ”‚ Row 3: {xmin: 400, xmax: 0}   โ† Live                        โ”‚
โ”‚ ...                                                         โ”‚
โ”‚ 50% dead space!                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Without VACUUM: Table grows forever. Performance degrades.


VACUUM Process

// src/transaction/vacuum.rs
pub struct VacuumWorker {
    buffer_pool: Arc<BufferPool>,
    transaction_manager: Arc<TransactionManager>,
}

impl VacuumWorker {
    pub fn vacuum_table(&self, table_id: u64) -> Result<VacuumStats, VacuumError> {
        let mut stats = VacuumStats::default();

        for page_id in self.get_table_pages(table_id) {
            let mut page = self.buffer_pool.get_page(page_id)?;

            // Get global xmin (oldest active transaction)
            let global_xmin = self.transaction_manager.get_global_xmin();

            // Scan all rows
            for row_id in page.rows() {
                let row = page.get_row(row_id);

                // Row is dead if xmax is committed
                if row.xmax != 0 && row.xmax < global_xmin {
                    // Mark as reusable
                    page.mark_row_free(row_id);
                    stats.dead_tuples_removed += 1;
                }
            }

            self.buffer_pool.mark_dirty(page_id);
        }

        Ok(stats)
    }
}

VACUUM doesnโ€™t lock:

OperationLock Level
Regular VACUUMShareUpdateExclusiveLock (allows reads/writes)
VACUUM FULLAccessExclusiveLock (blocks everything)

VACUUM FULL vs. Regular VACUUM

Regular VACUUM:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Before: [Dead][Live][Dead][Live][Dead][Live]               โ”‚
โ”‚ After:  [Free][Live][Free][Live][Free][Live]               โ”‚
โ”‚         (space reusable for new rows in same page)          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

VACUUM FULL:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Before: [Dead][Live][Dead][Live][Dead][Live]               โ”‚
โ”‚ After:  [Live][Live][Live][Free][Free][Free]               โ”‚
โ”‚         (compacted, dead tuples physically removed)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
pub fn vacuum_full(&self, table_id: u64) -> Result<(), VacuumError> {
    // 1. Acquire exclusive lock
    let _lock = self.lock_table(table_id, LockMode::AccessExclusive);

    // 2. Create new table file
    let new_table_id = self.create_temp_table();

    // 3. Copy only live tuples
    for row in self.scan_live_rows(table_id) {
        self.insert_into_table(new_table_id, row);
    }

    // 4. Swap tables (atomic rename)
    self.swap_tables(table_id, new_table_id);

    // 5. Release lock
    Ok(())
}

6 Transaction ID Wraparound: The 4 Billion Row Problem

The Math

TransactionId = u32  // 0 to 4,294,967,295

At 1000 transactions/second:
4,294,967,295 / 1000 = 4,294,967 seconds = ~50 days

After 50 days: OVERFLOW! ๐Ÿ˜ฑ

What Happens on Wraparound

Before wraparound:
Transaction 4,294,967,294: INSERT INTO users VALUES (1, 100);
Transaction 4,294,967,295: INSERT INTO users VALUES (2, 200);
Transaction 0 (wrapped):   SELECT * FROM users;
                           โ†’ Sees xid 4B as "older" than 0!
                           โ†’ Wrong visibility! CORRUPTION!

PostgreSQLโ€™s solution: 2-phase transaction IDs

// Transaction ID comparison with wraparound handling
pub fn transaction_id_precedes(id1: TransactionId, id2: TransactionId) -> bool {
    // Treat as signed 32-bit integers
    // This makes the comparison wrap-aware
    (id1 as i32 - id2 as i32) < 0
}

// Example:
// 4,294,967,294 as i32 = -2
// 0 as i32 = 0
// -2 < 0 โ†’ true (4B precedes 0) โœ“

Freezing Old Transactions

Vacuum freeze: Mark very old transactions as โ€œfrozenโ€

pub const FROZEN_XID: TransactionId = 2;  // Special value

pub fn vacuum_freeze(&self, table_id: u64, freeze_limit: TransactionId) -> Result<(), VacuumError> {
    for page_id in self.get_table_pages(table_id) {
        let mut page = self.get_page(page_id)?;

        for row_id in page.rows() {
            let row = page.get_row(row_id);

            // If xmin is old enough, freeze it
            if row.xmin < freeze_limit && row.xmin != FROZEN_XID {
                page.set_xmin(row_id, FROZEN_XID);
            }

            // If xmax is old enough, freeze it
            if row.xmax != 0 && row.xmax < freeze_limit && row.xmax != FROZEN_XID {
                page.set_xmax(row_id, FROZEN_XID);
            }
        }
    }

    Ok(())
}

Frozen rows are always visible:

impl Snapshot {
    pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>) -> bool {
        // Frozen rows are always visible
        if row_xmin == FROZEN_XID {
            return row_xmax.map_or(true, |xmax| xmax == FROZEN_XID);
        }

        // ... normal visibility logic ...
    }
}

Autovacuum: Preventing Wraparound Automatically

// src/transaction/autovacuum.rs
pub struct AutovacuumLauncher {
    transaction_manager: Arc<TransactionManager>,
    vacuum_worker: Arc<VacuumWorker>,
}

impl AutovacuumLauncher {
    pub fn run(&self) {
        loop {
            // Check how old the oldest transaction is
            let oldest_xmin = self.transaction_manager.get_oldest_xmin();
            let current_xid = self.transaction_manager.current_xid();

            // Distance to wraparound
            let distance_to_wraparound = TransactionId::MAX - current_xid + oldest_xmin;

            // If getting close, trigger vacuum
            if distance_to_wraparound < WRAPAROUND_EMERGENCY_THRESHOLD {
                self.vacuum_worker.vacuum_freeze_all_tables();
            }

            sleep(Duration::from_secs(60));
        }
    }
}

PostgreSQLโ€™s default thresholds:

ParameterDefaultMeaning
autovacuum_vacuum_threshold50Min dead tuples before vacuum
autovacuum_vacuum_scale_factor0.2+20% of table size
autovacuum_freeze_max_age200MMax transactions before forced freeze

7 Isolation Levels

ANSI SQL Isolation Levels

Isolation LevelDirty ReadNon-Repeatable ReadPhantom Read
Read UncommittedPossiblePossiblePossible
Read CommittedโŒ PreventedPossiblePossible
Repeatable ReadโŒ PreventedโŒ PreventedPossible
SerializableโŒ PreventedโŒ PreventedโŒ Prevented

PostgreSQLโ€™s Implementation

PostgreSQL uses MVCC for all isolation levels:

Isolation LevelImplementation
Read UncommittedSame as Read Committed
Read CommittedFresh snapshot per statement
Repeatable ReadSingle snapshot per transaction
SerializableSingle snapshot + predicate locking
#[derive(Debug, Clone, Copy)]
pub enum IsolationLevel {
    ReadCommitted,
    RepeatableRead,
    Serializable,
}

impl Transaction {
    pub fn get_snapshot(&self) -> Snapshot {
        match self.isolation_level {
            IsolationLevel::ReadCommitted => {
                // New snapshot for each statement
                self.transaction_manager.create_snapshot()
            }
            IsolationLevel::RepeatableRead | IsolationLevel::Serializable => {
                // Reuse same snapshot for entire transaction
                self.cached_snapshot.clone()
            }
        }
    }
}

Serializable Isolation: Predicate Locking

// Simplified predicate locking
pub struct SerializableTransaction {
    txn: Transaction,
    read_predicates: Vec<Predicate>,  // Ranges/conditions read
    write_set: Vec<RowId>,            // Rows written
}

pub struct Predicate {
    pub page_id: u64,
    pub key_range: Option<(BTreeKey, BTreeKey)>,  // None = full scan
}

impl TransactionManager {
    pub fn check_serializable_conflict(&self, txn: &SerializableTransaction) -> Result<(), SerializationError> {
        // Check if any committed write conflicts with our reads
        for predicate in &txn.read_predicates {
            for write in self.recent_writes() {
                if predicate.matches(&write) && write.committed_after(txn.snapshot.xmax) {
                    return Err(SerializationError::ReadWriteConflict);
                }
            }
        }

        Ok(())
    }
}

On conflict: Abort one transaction with serialization_failure error.


8 Challenges Building in Rust

Challenge 1: Snapshot Lifetime

Problem: Snapshots need to outlive the transaction that created them.

// โŒ Doesn't work
pub fn begin_transaction(&self) -> Transaction {
    let snapshot = self.create_snapshot();  // Borrowed from self
    Transaction { snapshot, ... }  // Snapshot doesn't live long
}

Solution: Owned snapshots

// โœ… Works
pub fn begin_transaction(&self) -> Transaction {
    let snapshot = self.create_snapshot();  // Returns owned Snapshot
    Transaction {
        snapshot: Arc::new(snapshot),  // Shareable across threads
        ...
    }
}

Challenge 2: Atomic Transaction State

Problem: Multiple threads need to see consistent transaction state.

// โŒ Race condition
pub fn commit(&self, txn: &mut Transaction) {
    txn.state = TransactionState::Committed;  // Not atomic!
    // Other threads might see partial state
}

Solution: Atomic state with proper ordering

// โœ… Works
pub struct Transaction {
    pub xid: TransactionId,
    pub state: AtomicU8,  // Use atomic for state
    pub snapshot: Arc<Snapshot>,
}

impl Transaction {
    pub fn commit(&self) {
        // 1. Write WAL first (durable)
        self.wal.log_commit(self.xid)?;

        // 2. Then mark as committed (visible)
        self.state.store(TransactionState::Committed as u8, Ordering::Release);

        // 3. Notify waiting transactions
        self.transaction_manager.notify_committed(self.xid);
    }
}

Challenge 3: Vacuum Without Blocking

Problem: How to vacuum while transactions are reading?

// โŒ Blocks readers
pub fn vacuum(&self) {
    let _lock = self.table_lock.write();  // Exclusive lock
    self.remove_dead_tuples();
}

Solution: Two-phase vacuum

// โœ… Non-blocking
pub fn vacuum(&self) {
    // Phase 1: Mark tuples as prune-able (no lock needed)
    let global_xmin = self.get_global_xmin();
    self.mark_pruneable(global_xmin);

    // Phase 2: Reclaim space (uses page-level locks, not table lock)
    for page in self.pages.iter() {
        let _page_lock = page.lock.write();
        self.reclaim_space_on_page(page);
    }
}

9 How AI Accelerated This

What AI Got Right

TaskAI Contribution
Visibility rulesGenerated correct xmin/xmax logic
Wraparound handlingExplained 2โ€™s complement trick
Snapshot structureSuggested xmin/xmax/active pattern
VACUUM designOutlined two-phase approach

What AI Got Wrong

IssueWhat Happened
Initial visibilityFirst draft didnโ€™t handle own uncommitted writes
Freeze logicMissed that frozen rows need special handling in visibility
Serializable isolationSuggested full serializability without predicate locking (wrong!)

Pattern: MVCC is subtle. AI gets the 80% case. Edge cases require deep understanding.


Example: Debugging a Visibility Bug

My question to AI:

โ€œTransaction A inserts a row, then selects it. But the select doesnโ€™t see the row. Why?โ€

What I learned:

  1. Transactions must see their own uncommitted writes
  2. Need to track current_transaction_id in snapshot
  3. Visibility check needs special case for row_xmin == my_xid

Result: Fixed is_visible():

pub fn is_visible(&self, row_xmin: TransactionId, row_xmax: Option<TransactionId>, my_xid: TransactionId) -> bool {
    // Special case: see your own writes
    if row_xmin == my_xid {
        return row_xmax.map_or(true, |xmax| xmax != my_xid);
    }

    // ... rest of visibility logic ...
}

Summary: MVCC in One Diagram

flowchart BT subgraph "Transaction Lifecycle" A[BEGIN] --> B[Get Snapshot] B --> C[Read/Write with MVCC] C --> D{COMMIT or ROLLBACK?} D -->|COMMIT| E[Write WAL] D -->|ROLLBACK| F[Discard Changes] E --> G[Mark Committed] end subgraph "MVCC Row States" H[Live: xmin committed, xmax=0] I[Dead: xmin & xmax committed] J[Frozen: xmin=FROZEN_XID] end subgraph "VACUUM" K[Regular VACUUM] --> L[Mark space reusable] M[VACUUM FULL] --> N[Compact table] O[AutoVacuum] --> P[Prevent wraparound] end subgraph "Isolation Levels" Q[Read Committed] --> R[New snapshot per statement] S[Repeatable Read] --> T[Single snapshot] U[Serializable] --> V[Predicate locking] end style B fill:#e3f2fd,stroke:#1976d2 style H fill:#e8f5e9,stroke:#388e3c style I fill:#ffebee,stroke:#c62828 style O fill:#fff3e0,stroke:#f57c00

Key Takeaways:

ConceptWhy It Matters
MVCCReaders donโ€™t block writers, writers donโ€™t block readers
SnapshotsConsistent view of data at a point in time
Transaction IDsTrack version creation/deletion
VACUUMReclaim space from dead versions
Wraparound4 billion transaction limit requires freezing
Isolation levelsTrade-off between consistency and concurrency

Further Reading: