In Part 3, we built MVCC for concurrent transactions. But thereโs a terrifying question we havenโt answered.
What happens when the power goes out?
Transaction: UPDATE accounts SET balance = 1000 WHERE id = 1
1. Read page into buffer pool
2. Modify page in memory (balance = 1000)
3. Mark page as dirty
4. ACK to client: "Done!"
5. โก POWER FAILURE โก
6. Dirty page never written to disk
7. Client's money: GONE ๐ธ
This is why databases use WAL: Write-Ahead Logging.
Today: implementing WAL and the ARIES recovery algorithm in Rust. This is the code that ensures your data survives crashes, power failures, and kernel panics.
1 The WAL Principle
The Fundamental Rule
Write-Ahead Logging: Before modifying any data page, you MUST write the change to the WAL.
// โ WRONG - data modification before WAL
pub fn update(&self, row_id: RowId, new_data: &[u8]) -> Result<(), Error> {
let mut page = self.buffer_pool.get_page(row_id.page_id)?;
page.write(row_id.offset, new_data); // Modified in memory!
page.mark_dirty();
// WAL comes too late
self.wal.log_update(row_id, new_data)?; // Too late!
Ok(())
}
// โ
CORRECT - WAL first
pub fn update(&self, row_id: RowId, new_data: &[u8]) -> Result<(), Error> {
// 1. Generate LSN (Log Sequence Number)
let lsn = self.wal.log_update(row_id, new_data)?;
// 2. Flush WAL to disk (fsync)
self.wal.flush(lsn)?;
// 3. NOW safe to modify page
let mut page = self.buffer_pool.get_page(row_id.page_id)?;
page.write(row_id.offset, new_data);
page.set_lsn(lsn); // Track which LSN modified this page
page.mark_dirty();
Ok(())
}
Why this works:
Crash at different points:
After WAL write, before page modify:
โ Recovery replays WAL, data is restored โ
After page modify, before flush:
โ WAL on disk, recovery ensures durability โ
After flush to disk:
โ Data is durable โ
Log Sequence Numbers (LSN)
Every WAL record gets a unique, monotonically increasing identifier:
// src/wal/lsn.rs
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct Lsn(u64);
impl Lsn {
pub const INVALID: Lsn = Lsn(0);
pub const fn new(value: u64) -> Self {
Self(value)
}
pub const fn invalid(&self) -> bool {
self.0 == 0
}
// LSN encoding: segment_id (high 32) + offset (low 32)
pub const fn from_segment_offset(segment: u32, offset: u32) -> Self {
Self(((segment as u64) << 32) | (offset as u64))
}
pub const fn segment(&self) -> u32 {
(self.0 >> 32) as u32
}
pub const fn offset(&self) -> u32 {
(self.0 & 0xFFFFFFFF) as u32
}
}
LSN ordering guarantees:
LSN 100: BEGIN txn 1
LSN 101: INSERT row A (txn 1)
LSN 102: INSERT row B (txn 1)
LSN 103: COMMIT txn 1
LSN 104: BEGIN txn 2
LSN 105: UPDATE row A (txn 2)
...
LSN 100 < LSN 101 < LSN 102 < ... (strictly increasing)
2 WAL Record Format
Record Structure
// src/wal/record.rs
use crate::storage::PageId;
use crate::transaction::TransactionId;
#[derive(Debug, Clone)]
pub struct WalRecord {
pub lsn: Lsn,
pub prev_lsn: Lsn, // Link to previous record from same transaction
pub transaction_id: TransactionId,
pub record_type: WalRecordType,
pub page_id: PageId,
pub offset: u16,
pub data: WalData,
pub checksum: u32,
}
#[derive(Debug, Clone)]
pub enum WalRecordType {
Begin,
Insert,
Update { before_image: Vec<u8>, after_image: Vec<u8> },
Delete { before_image: Vec<u8> },
Commit,
Abort,
CheckpointBegin,
CheckpointEnd,
Compensation { undo_next_lsn: Lsn }, // For aborts
}
#[derive(Debug, Clone)]
pub enum WalData {
PageImage(Vec<u8>), // Full page image (for checkpoints)
RowData(Vec<u8>), // Row-level change
IndexEntry { key: Vec<u8>, value: Vec<u8> },
}
Physical layout on disk:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WAL Segment File (e.g., 000000010000000000000001) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PageHeader (16 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Record 1: โ
โ โโ Length (4 bytes) โ
โ โโ LSN (8 bytes) โ
โ โโ Prev LSN (8 bytes) โ
โ โโ Transaction ID (4 bytes) โ
โ โโ Record Type (1 byte) โ
โ โโ Page ID (8 bytes) โ
โ โโ Offset (2 bytes) โ
โ โโ Data Length (4 bytes) โ
โ โโ Data (variable) โ
โ โโ Checksum (4 bytes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Record 2: โ
โ ... โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Physical vs. Logical WAL
PostgreSQL uses physical WAL (page-level changes):
| Type | Whatโs Logged | Pros | Cons |
|---|---|---|---|
| Physical | Byte ranges on pages | Simple replay, exact changes | Larger logs, page-format dependent |
| Logical | SQL operations (INSERT/UPDATE) | Smaller logs, format-independent | Complex replay, must re-execute |
Vaultgres approach: Physical WAL for simplicity (matching PostgreSQL):
pub struct PhysicalWalRecord {
pub page_id: PageId,
pub offset: u16,
pub length: u16,
pub before_image: Option<Vec<u8>>, // For undo
pub after_image: Vec<u8>, // For redo
}
3 WAL Manager Implementation
Writing WAL Records
// src/wal/manager.rs
use std::fs::{File, OpenOptions};
use std::io::{Write, Seek, SeekFrom};
use std::path::PathBuf;
use parking_lot::Mutex;
pub struct WalManager {
wal_dir: PathBuf,
current_segment: u32,
current_file: Mutex<File>,
current_offset: Mutex<u32>,
flush_lsn: Mutex<Lsn>,
last_lsn: AtomicU64,
}
impl WalManager {
pub fn new(wal_dir: &str) -> Result<Self, WalError> {
std::fs::create_dir_all(wal_dir)?;
let mut manager = Self {
wal_dir: PathBuf::from(wal_dir),
current_segment: 0,
current_file: Mutex::new(File::create("")?), // Placeholder
current_offset: Mutex::new(0),
flush_lsn: Mutex::new(Lsn::INVALID),
last_lsn: AtomicU64::new(0),
};
manager.open_or_create_segment(0)?;
Ok(manager)
}
fn open_or_create_segment(&mut self, segment: u32) -> Result<(), WalError> {
let path = self.wal_dir.join(format!("{:024X}", segment));
let file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(&path)?;
*self.current_file.lock() = file;
self.current_segment = segment;
*self.current_offset.lock() = 0;
Ok(())
}
pub fn append(&self, record: WalRecord) -> Result<Lsn, WalError> {
// Assign new LSN
let lsn = Lsn::new(self.last_lsn.fetch_add(1, Ordering::SeqCst) + 1);
// Serialize record
let mut buffer = Vec::new();
self.serialize_record(&record, lsn, &mut buffer)?;
// Check if we need a new segment
let mut offset = self.current_offset.lock();
if *offset + buffer.len() as u32 > SEGMENT_SIZE {
drop(offset);
self.rotate_segment()?;
offset = self.current_offset.lock();
}
// Write to file
let mut file = self.current_file.lock();
file.seek(SeekFrom::Start(*offset as u64))?;
file.write_all(&buffer)?;
*offset += buffer.len() as u32;
Ok(lsn)
}
pub fn flush(&self, target_lsn: Lsn) -> Result<(), WalError> {
let mut flush_lsn = self.flush_lsn.lock();
// Already flushed?
if *flush_lsn >= target_lsn {
return Ok(());
}
// Sync to disk
self.current_file.lock().sync_all()?;
*flush_lsn = target_lsn;
Ok(())
}
}
Reading WAL Records
impl WalManager {
pub fn read_from(&self, start_lsn: Lsn) -> Result<WalIterator, WalError> {
let segment = start_lsn.segment();
let offset = start_lsn.offset();
let path = self.wal_dir.join(format!("{:024X}", segment));
let file = File::open(&path)?;
Ok(WalIterator {
current_file: file,
current_segment: segment,
current_offset: offset,
wal_dir: self.wal_dir.clone(),
})
}
}
pub struct WalIterator {
current_file: File,
current_segment: u32,
current_offset: u32,
wal_dir: PathBuf,
}
impl Iterator for WalIterator {
type Item = Result<WalRecord, WalError>;
fn next(&mut self) -> Option<Self::Item> {
// Try to read record at current position
match self.read_record() {
Ok(Some(record)) => Some(Ok(record)),
Ok(None) => {
// End of segment, try next segment
self.current_segment += 1;
self.current_offset = 0;
let path = self.wal_dir.join(format!("{:024X}", self.current_segment));
self.current_file = File::open(&path).ok()?;
self.read_record().map(|r| r.ok()).flatten()
}
Err(e) => Some(Err(e)),
}
}
}
4 Checkpoints: Limiting Recovery Time
The Problem: Unbounded Replay
Without checkpoints:
Day 1: Database created
Day 30: Crash!
Recovery: Replay 30 days of WAL records ๐ฑ
Solution: Periodic checkpoints.
Checkpoint Process
// src/wal/checkpoint.rs
pub struct Checkpoint {
pub checkpoint_lsn: Lsn,
pub active_transactions: Vec<TransactionId>,
pub dirty_pages: Vec<(PageId, Lsn)>, // page_id โ page_lsn
}
impl WalManager {
pub fn create_checkpoint(&self, buffer_pool: &BufferPool) -> Result<Checkpoint, WalError> {
// 1. Log checkpoint begin
let begin_lsn = self.append(WalRecord::checkpoint_begin())?;
// 2. Flush all dirty pages (this is the expensive part!)
buffer_pool.flush_all()?;
// 3. Get current state
let checkpoint = Checkpoint {
checkpoint_lsn: begin_lsn,
active_transactions: self.get_active_transactions(),
dirty_pages: buffer_pool.get_dirty_pages(),
};
// 4. Log checkpoint end
let end_lsn = self.append(WalRecord::checkpoint_end(&checkpoint))?;
// 5. Flush WAL
self.flush(end_lsn)?;
// 6. Save checkpoint to known location
self.save_checkpoint_record(&checkpoint)?;
Ok(checkpoint)
}
}
Fuzzy Checkpoints (PostgreSQL Style)
Sharp checkpoint: Blocks all writes during checkpoint. Simple but causes pauses.
Fuzzy checkpoint: Allows writes during checkpoint. Complex but no pauses.
// Fuzzy checkpoint approach
pub fn create_fuzzy_checkpoint(&self, buffer_pool: &BufferPool) -> Result<Checkpoint, WalError> {
// 1. Record checkpoint start LSN
let start_lsn = self.current_lsn();
// 2. Write CHECKPOINT_BEGIN (don't block)
self.append(WalRecord::checkpoint_begin())?;
// 3. Get list of dirty pages (snapshot)
let dirty_pages = buffer_pool.get_dirty_pages_snapshot();
// 4. Flush dirty pages in background (don't block new writes)
let checkpoint_lsn = self.current_lsn();
for (page_id, page_lsn) in dirty_pages {
if page_lsn < checkpoint_lsn {
buffer_pool.flush_page(page_id)?;
}
}
// 5. Write CHECKPOINT_END with final LSN
let end_lsn = self.current_lsn();
self.append(WalRecord::checkpoint_end_at(end_lsn))?;
self.flush(end_lsn)?;
Ok(Checkpoint {
checkpoint_lsn: end_lsn,
// ...
})
}
5 ARIES: The Recovery Algorithm
The Three Phases
ARIES = Algorithm for Recovery and Isolation Exploiting Semantics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ARIES Recovery โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Phase 1: ANALYSIS โ
โ - Scan WAL from last checkpoint โ
โ - Determine which transactions were active at crash โ
โ - Build dirty page table โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Phase 2: REDO โ
โ - Replay ALL logged changes from analysis end โ
โ - Bring database to exact crash state โ
โ - Skip pages already on disk (using page LSN) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Phase 3: UNDO โ
โ - Roll back all uncommitted transactions โ
โ - Write Compensation Log Records (CLRs) โ
โ - Database is now consistent โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Phase 1: Analysis
// src/recovery/analysis.rs
pub struct AnalysisResult {
pub transactions_at_crash: HashMap<TransactionId, TxnStatus>,
pub dirty_page_table: HashMap<PageId, Lsn>, // page โ first LSN that dirtied it
pub redo_start_lsn: Lsn,
}
pub fn analyze(wal: &WalManager, checkpoint: &Checkpoint) -> Result<AnalysisResult, RecoveryError> {
let mut txn_status: HashMap<TransactionId, TxnStatus> = HashMap::new();
let mut dirty_page_table: HashMap<PageId, Lsn> = HashMap::new();
// Initialize from checkpoint
for txn in &checkpoint.active_transactions {
txn_status.insert(*txn, TxnStatus::Active);
}
for (page_id, page_lsn) in &checkpoint.dirty_pages {
dirty_page_table.insert(*page_id, *page_lsn);
}
// Scan WAL from checkpoint
let mut iterator = wal.read_from(checkpoint.checkpoint_lsn)?;
for record_result in iterator {
let record = record_result?;
match record.record_type {
WalRecordType::Begin => {
txn_status.insert(record.transaction_id, TxnStatus::Active);
}
WalRecordType::Commit => {
txn_status.insert(record.transaction_id, TxnStatus::Committed);
}
WalRecordType::Abort => {
txn_status.insert(record.transaction_id, TxnStatus::Aborted);
}
WalRecordType::Insert | WalRecordType::Update | WalRecordType::Delete => {
// Track first LSN that dirtied each page
dirty_page_table.entry(record.page_id)
.or_insert(record.lsn);
}
_ => {}
}
}
// Find minimum redo LSN
let redo_start_lsn = dirty_page_table.values()
.min()
.copied()
.unwrap_or(checkpoint.checkpoint_lsn);
Ok(AnalysisResult {
transactions_at_crash: txn_status,
dirty_page_table,
redo_start_lsn,
})
}
Phase 2: Redo
// src/recovery/redo.rs
pub fn redo(
wal: &WalManager,
buffer_pool: &BufferPool,
analysis: &AnalysisResult,
) -> Result<(), RecoveryError> {
let mut iterator = wal.read_from(analysis.redo_start_lsn)?;
for record_result in iterator {
let record = record_result?;
// Only redo committed transactions' changes
// (We redo ALL changes first, undo uncommitted later)
match &record.record_type {
WalRecordType::Insert | WalRecordType::Update | WalRecordType::Delete => {
// Check if page needs redo
let page = buffer_pool.get_page(record.page_id)?;
let page_lsn = page.get_lsn();
// Only apply if page is older than this LSN
if page_lsn < record.lsn {
apply_redo(&record, &page)?;
page.set_lsn(record.lsn);
}
// Else: page already has this change (written before crash)
}
_ => {}
}
}
Ok(())
}
fn apply_redo(record: &WalRecord, page: &Page) -> Result<(), RecoveryError> {
match &record.data {
WalData::PageImage(image) => {
// Full page image (from checkpoint)
page.copy_from_slice(image);
}
WalData::RowData(data) => {
// Partial page update
page.write(record.offset, data);
}
_ => {}
}
Ok(())
}
Key insight: Redo is idempotent. Running it multiple times produces the same result.
Phase 3: Undo
// src/recovery/undo.rs
pub fn undo(
wal: &mut WalManager,
buffer_pool: &BufferPool,
analysis: &AnalysisResult,
) -> Result<(), RecoveryError> {
// Find transactions to undo (active at crash, not committed)
let losers: Vec<TransactionId> = analysis.transactions_at_crash
.iter()
.filter(|(_, status)| **status == TxnStatus::Active)
.map(|(txn, _)| *txn)
.collect();
// Undo in reverse order (LIFO - Last Committed, First Undone)
for txn_id in losers.into_iter().rev() {
undo_transaction(wal, buffer_pool, txn_id)?;
}
Ok(())
}
fn undo_transaction(
wal: &mut WalManager,
buffer_pool: &BufferPool,
txn_id: TransactionId,
) -> Result<(), RecoveryError> {
// Find all records for this transaction (in reverse)
let records = wal.get_transaction_records(txn_id)?;
for record in records.iter().rev() {
// Write Compensation Log Record (CLR)
let clr = WalRecord {
lsn: wal.next_lsn(),
prev_lsn: record.lsn,
transaction_id: txn_id,
record_type: WalRecordType::Compensation {
undo_next_lsn: record.prev_lsn,
},
page_id: record.page_id,
offset: record.offset,
data: WalData::RowData(record.before_image.clone().unwrap_or_default()),
checksum: 0, // Calculate checksum
};
let clr_lsn = wal.append(clr)?;
// Apply undo (restore before_image)
if let Some(before_image) = &record.before_image {
let page = buffer_pool.get_page(record.page_id)?;
page.write(record.offset, before_image);
page.set_lsn(clr_lsn);
}
}
// Log transaction abort
wal.append(WalRecord::abort(txn_id))?;
wal.flush_all()?;
Ok(())
}
Complete Recovery Process
// src/recovery/mod.rs
pub fn recover(
wal: &mut WalManager,
buffer_pool: &BufferPool,
data_dir: &str,
) -> Result<RecoveryStats, RecoveryError> {
// 1. Find last checkpoint
let checkpoint = find_last_checkpoint(wal, data_dir)?;
println!("Starting recovery from checkpoint LSN {}", checkpoint.checkpoint_lsn);
// 2. Phase 1: Analysis
println!("Phase 1: Analysis...");
let analysis = analyze(wal, &checkpoint)?;
println!(" Found {} active transactions at crash",
analysis.transactions_at_crash.len());
println!(" Redo will start from LSN {}", analysis.redo_start_lsn);
// 3. Phase 2: Redo
println!("Phase 2: Redo...");
redo(wal, buffer_pool, &analysis)?;
println!(" Redo complete");
// 4. Phase 3: Undo
println!("Phase 3: Undo...");
undo(wal, buffer_pool, &analysis)?;
println!(" Undo complete");
// 5. Truncate old WAL (optional)
truncate_wal_before(wal, checkpoint.checkpoint_lsn)?;
Ok(RecoveryStats {
checkpoint_lsn: checkpoint.checkpoint_lsn,
redo_start_lsn: analysis.redo_start_lsn,
transactions_aborted: analysis.transactions_at_crash
.iter()
.filter(|(_, s)| **s == TxnStatus::Active)
.count(),
})
}
6 Recovery Example: Step by Step
Crash Scenario
Time LSN Transaction Action
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
10:00 100 CKPT Checkpoint created
10:01 101 T1 (xid=1) BEGIN
10:02 102 T1 INSERT row A (balance=100)
10:03 103 T2 (xid=2) BEGIN
10:04 104 T2 INSERT row B (balance=200)
10:05 105 T1 COMMIT
10:06 106 T2 UPDATE row B (balance=250)
10:07 โโโโโ โก CRASH โก
State at crash:
- T1: Committed (LSN 105)
- T2: Active (not committed)
- Dirty pages: A (LSN 102), B (LSN 106)
Recovery Execution
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 1: ANALYSIS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Start from checkpoint LSN 100 โ
โ Scan records 100-106 โ
โ โ
โ Result: โ
โ - T1: Committed (LSN 105) โ
โ - T2: Active (loser!) โ
โ - Dirty pages: Aโ102, Bโ106 โ
โ - Redo start: LSN 102 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 2: REDO โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Replay from LSN 102: โ
โ โ
โ LSN 102: INSERT row A โ
โ โ Check page A LSN โ
โ โ If page LSN < 102: apply insert โ
โ โ Else: skip (already on disk) โ
โ โ
โ LSN 104: INSERT row B โ
โ โ Apply if needed โ
โ โ
โ LSN 106: UPDATE row B โ
โ โ Apply if needed โ
โ โ
โ Result: Database at exact crash state โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Phase 3: UNDO โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Loser transactions: T2 โ
โ โ
โ Undo T2 (in reverse order): โ
โ 1. Undo LSN 106 (UPDATE B: 200โ250) โ
โ โ Write CLR: undo_next_lsn = 104 โ
โ โ Restore B to balance=200 โ
โ โ
โ 2. Undo LSN 104 (INSERT B) โ
โ โ Write CLR: undo_next_lsn = 103 โ
โ โ Delete row B โ
โ โ
โ 3. Log T2 ABORT โ
โ โ
โ Result: T2's changes rolled back, T1's changes preserved โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
7 WAL Archiving and Point-in-Time Recovery
WAL Archiving
Continuous archiving: Copy WAL segments to safe storage before reuse.
// src/wal/archiver.rs
pub struct WalArchiver {
wal_dir: PathBuf,
archive_dir: PathBuf,
archive_timeout: Duration,
last_archived_segment: u32,
}
impl WalArchiver {
pub fn archive_ready_segments(&mut self) -> Result<(), WalError> {
// Find completed segments (can't be overwritten)
for segment in self.get_ready_segments() {
let src = self.wal_dir.join(format!("{:024X}", segment));
let dst = self.archive_dir.join(format!("{:024X}.backup", segment));
// Copy to archive (could be remote storage like S3)
std::fs::copy(&src, &dst)?;
self.last_archived_segment = segment;
}
Ok(())
}
}
Point-in-Time Recovery (PITR)
Goal: Restore database to state at 2026-03-22 14:30:00
1. Restore base backup from 2026-03-22 00:00:00
2. Replay WAL segments from archive
3. Stop replay at target time (14:30:00)
4. Database restored to exact point in time
// src/recovery/pitr.rs
pub fn recover_to_point_in_time(
base_backup: &str,
archive_dir: &str,
target_time: chrono::DateTime<chrono::Utc>,
) -> Result<(), RecoveryError> {
// 1. Restore base backup
restore_base_backup(base_backup)?;
// 2. Find WAL segments to replay
let segments = find_wal_segments_for_time_range(archive_dir, target_time)?;
// 3. Replay WAL up to target time
for segment in segments {
let records = read_wal_segment(&segment)?;
for record in records {
// Check if we've passed target time
if record.timestamp > target_time {
println!("Reached target time, stopping recovery");
return Ok(());
}
apply_redo_record(&record)?;
}
}
Ok(())
}
8 Challenges Building in Rust
Challenge 1: fsync and Durability
Problem: Rustโs File::sync_all() is correct, but easy to forget.
// โ Missing fsync - data NOT durable!
pub fn commit(&self, txn_id: TransactionId) -> Result<(), Error> {
let record = WalRecord::commit(txn_id);
self.wal.append(record)?;
// Forgot to flush!
Ok(())
}
// โ
Correct
pub fn commit(&self, txn_id: TransactionId) -> Result<(), Error> {
let record = WalRecord::commit(txn_id);
let lsn = self.wal.append(record)?;
self.wal.flush(lsn)?; // fsync!
Ok(())
}
Lesson: Wrap WAL operations in safe abstractions that enforce flushing.
Challenge 2: LSN Ordering and Concurrency
Problem: Multiple threads appending WAL records must get monotonically increasing LSNs.
// โ Race condition - LSNs not ordered!
pub fn append(&self, record: WalRecord) -> Lsn {
let lsn = self.current_lsn + 1; // Not atomic!
self.current_lsn = lsn;
// ...
}
// โ
Correct - atomic LSN allocation
pub fn append(&self, record: WalRecord) -> Result<Lsn, WalError> {
let lsn = Lsn::new(self.last_lsn.fetch_add(1, Ordering::SeqCst) + 1);
// ...
}
Challenge 3: Partial Writes and Checksums
Problem: Crash during WAL write = partial record on disk.
// Solution: Checksums + length prefix
pub fn serialize_record(record: &WalRecord, buffer: &mut Vec<u8>) {
let data_start = buffer.len();
// Write placeholder for length (fill in later)
buffer.extend_from_slice(&[0u8; 4]);
// Write record fields
buffer.extend_from_slice(&record.lsn.to_le_bytes());
// ... more fields ...
buffer.extend_from_slice(&record.data);
// Calculate checksum over everything except checksum field
let checksum = crc32::calculate(&buffer[data_start + 4..]);
buffer.extend_from_slice(&checksum.to_le_bytes());
// Fill in length
let total_len = buffer.len() - data_start;
buffer[data_start..data_start + 4]
.copy_from_slice(&(total_len as u32).to_le_bytes());
}
pub fn deserialize_record(buffer: &[u8]) -> Result<WalRecord, WalError> {
// Read length
let length = u32::from_le_bytes(buffer[0..4].try_into()?) as usize;
// Verify we have enough data
if buffer.len() < length {
return Err(WalError::PartialWrite);
}
// Verify checksum
let stored_checksum = u32::from_le_bytes(
buffer[length - 4..length].try_into()?
);
let calculated = crc32::calculate(&buffer[4..length - 4]);
if stored_checksum != calculated {
return Err(WalError::ChecksumMismatch);
}
// ... parse record ...
}
9 How AI Accelerated This
What AI Got Right
| Task | AI Contribution |
|---|---|
| ARIES phases | Explained analysis/redo/undo clearly |
| LSN structure | Suggested segment/offset encoding |
| Checkpoint design | Outlined fuzzy vs. sharp trade-offs |
| CLR records | Explained compensation log record purpose |
What AI Got Wrong
| Issue | What Happened |
|---|---|
| Redo logic | First draft redid only committed txns (wrong! Redo ALL, then undo) |
| Undo order | Suggested forward order instead of reverse (LIFO) |
| Page LSN | Missed that page LSN is used to skip redundant redos |
Pattern: ARIES is subtle. The โredo all, undo someโ insight is counterintuitive.
Example: Understanding Redo Philosophy
My question to AI:
โWhy does ARIES redo uncommitted transactions? Shouldnโt we only redo committed ones?โ
What I learned:
- Redo phase: Bring database to exact crash state (including uncommitted changes)
- Undo phase: Roll back uncommitted transactions
- Why? Simpler than tracking dependencies during redo
- Key insight: Redo is idempotent, undo must be logged (CLRs)
Result: Correct redo implementation:
// Redo ALL records, not just committed
if page_lsn < record.lsn {
apply_redo(&record, &page)?; // Apply regardless of txn status
}
// Undo phase will handle uncommitted transactions
Summary: WAL and ARIES in One Diagram
Key Takeaways:
| Concept | Why It Matters |
|---|---|
| WAL | Durability without sacrificing performance |
| LSN | Total ordering of all changes |
| Checkpoints | Bound recovery time |
| ARIES Analysis | Determine what needs recovery |
| ARIES Redo | Replay to exact crash state |
| ARIES Undo | Roll back uncommitted work |
| CLRs | Idempotent undo, prevents re-undo |
Further Reading:
- โARIES: A Transaction Recovery Method Supporting Fine Granularity Lockingโ by Mohan et al. (1992)
- PostgreSQL Source:
src/backend/access/transam/xlog.c - PostgreSQL Source:
src/backend/access/transam/xlogfuncs.c - โDatabase Management Systemsโ by Ramakrishnan (Ch. 17: Recovery)
- โReadings in Database Systemsโ (Red Book) - ARIES chapter
- Vaultgres Repository: github.com/neoalienson/Vaultgres
Comments
Please accept the "Functionality" cookie category to view and post comments.
Comments failed to load. You can try again or view the discussion directly on GitHub.
View on GitHub