Database in Rust: Histogram Statistics for Accurate Selectivity Estimation

In Part 7, we built a cost-based optimizer. But there’s a problem.

Our selectivity estimates are guesses:

// From Part 7 - simplified (wrong!) estimates
fn estimate_selectivity(&self, expr: &Expression) -> f64 {
    match expr {
        BinaryOperator::Eq => 0.01,   // Always 1%? Wrong!
        BinaryOperator::Gt => 0.33,   // Always 33%? Wrong!
        _ => 0.5,                      // Always 50%? Very wrong!
    }
}

Real data isn’t uniform:

User balances in our database:

Balance Distribution:
$0-100:     ████████████████████████████████████████  80% of users
$100-1K:    ████████                                   15% of users
$1K-10K:    ██                                          4% of users
$10K-100K:  █                                           1% of users

Query: SELECT * FROM users WHERE balance > 100

Our estimate: 33% of rows (using fixed 0.33)
Actual: 20% of rows
→ Wrong plan chosen! Index scan would be better than seq scan.

Solution: Histograms that capture actual data distribution.

Today: implementing histogram statistics in Rust for accurate selectivity estimation.

1 Why Histograms Matter

The Uniform Distribution Fallacy

Without histograms, optimizers assume uniform distribution:

-- Table: users (1,000,000 rows)
-- Column: account_type ('free', 'premium', 'enterprise')

-- Reality:
'free':       950,000 rows (95%)
'premium':     45,000 rows (4.5%)
'enterprise':   5,000 rows (0.5%)

-- Query 1:
SELECT * FROM users WHERE account_type = 'enterprise'
-- Optimizer estimate (uniform): 1,000,000 / 3 = 333,333 rows
-- Actual: 5,000 rows
-- Error: 67x overestimate!

-- Query 2:
SELECT * FROM users WHERE account_type = 'free'
-- Optimizer estimate (uniform): 333,333 rows
-- Actual: 950,000 rows
-- Error: 3x underestimate!

With histograms: We know the actual distribution.

Impact on Join Order

SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.account_type = 'enterprise'

Without histogram:

Optimizer thinks 'enterprise' = 333,333 rows
Plan: SeqScan(users) → Hash Join → SeqScan(orders)
Cost: 5000 (wrong!)

With histogram:

Histogram shows 'enterprise' = 5,000 rows
Plan: IndexScan(users) → Nested Loop → IndexScan(orders)
Cost: 100 (correct!)

Result: 50x faster query.

2 Histogram Types

Equi-Width Histograms

Equal bucket ranges, varying row counts:

Data: [1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Equi-Width (3 buckets, range 1-14):
┌─────────────────────────────────────────────────────────────┐
│ Bucket 1: [1-5]     → 5 rows  │████████████████████████████│
│ Bucket 2: [6-10]    → 1 row   │██████                      │
│ Bucket 3: [11-14]   → 4 rows  │██████████████████████      │
└─────────────────────────────────────────────────────────────┘

#[derive(Debug, Clone)]
pub struct EquiWidthHistogram {
    pub min_value: Value,
    pub max_value: Value,
    pub num_buckets: usize,
    pub buckets: Vec<WidthBucket>,
}

#[derive(Debug, Clone)]
pub struct WidthBucket {
    pub lower_bound: Value,
    pub upper_bound: Value,
    pub row_count: u64,
    pub distinct_count: u64,
    pub null_count: u64,
}

impl EquiWidthHistogram {
    pub fn build(values: &[Value], num_buckets: usize) -> Result<Self, HistogramError> {
        if values.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        let (min, max) = Self::find_min_max(values)?;
        let bucket_width = max.subtract(&min).divide(num_buckets as f64);

        let mut buckets = vec![WidthBucket {
            lower_bound: min.clone(),
            upper_bound: min.clone(),
            row_count: 0,
            distinct_count: 0,
            null_count: 0,
        }; num_buckets];

        // Initialize bucket bounds
        for i in 0..num_buckets {
            buckets[i].lower_bound = min.add(bucket_width.multiply(i as f64));
            buckets[i].upper_bound = min.add(bucket_width.multiply((i + 1) as f64));
        }

        // Count rows in each bucket
        for value in values {
            if value.is_null() {
                buckets[0].null_count += 1;
                continue;
            }

            let bucket_idx = Self::find_bucket_index(value, &min, &bucket_width, num_buckets);
            if bucket_idx < num_buckets {
                buckets[bucket_idx].row_count += 1;
            }
        }

        Ok(Self {
            min_value: min,
            max_value: max,
            num_buckets,
            buckets,
        })
    }

    fn find_bucket_index(value: &Value, min: &Value, width: &Value, num_buckets: usize) -> usize {
        let offset = value.subtract(min);
        let bucket = offset.divide(width).to_f64().floor() as usize;
        bucket.min(num_buckets - 1)
    }
}

Pros: Simple, good for uniform data.

Cons: Poor for skewed data (most buckets empty, one bucket huge).

Equi-Depth Histograms (PostgreSQL Style)

Equal row counts per bucket, varying ranges:

Data: [1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Equi-Depth (3 buckets, ~3-4 rows each):
┌─────────────────────────────────────────────────────────────┐
│ Bucket 1: [1-4]     → 4 rows  │████████████████████████████│
│ Bucket 2: [5-11]    → 3 rows  │████████████████████████████│
│ Bucket 3: [12-14]   → 3 rows  │████████████████████████████│
└─────────────────────────────────────────────────────────────┘

#[derive(Debug, Clone)]
pub struct EquiDepthHistogram {
    pub num_buckets: usize,
    pub total_rows: u64,
    pub buckets: Vec<DepthBucket>,
}

#[derive(Debug, Clone)]
pub struct DepthBucket {
    pub lower_bound: Value,
    pub upper_bound: Value,
    pub row_count: u64,
    pub distinct_count: u64,
    pub cumulative_count: u64,  // Rows <= upper_bound
}

impl EquiDepthHistogram {
    pub fn build(values: &[Value], num_buckets: usize) -> Result<Self, HistogramError> {
        if values.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        // Sort values
        let mut sorted_values = values.to_vec();
        sorted_values.sort();

        let total_rows = sorted_values.len() as u64;
        let target_bucket_size = (total_rows as f64 / num_buckets as f64).ceil() as usize;

        let mut buckets = Vec::new();
        let mut cumulative = 0u64;

        for i in 0..num_buckets {
            let start = i * target_bucket_size;
            let end = ((i + 1) * target_bucket_size).min(sorted_values.len());

            if start >= sorted_values.len() {
                break;
            }

            let bucket_values = &sorted_values[start..end];
            let row_count = bucket_values.len() as u64;
            let distinct_count = bucket_values.iter().collect::<std::collections::HashSet<_>>().len() as u64;

            cumulative += row_count;

            buckets.push(DepthBucket {
                lower_bound: bucket_values.first().unwrap().clone(),
                upper_bound: bucket_values.last().unwrap().clone(),
                row_count,
                distinct_count,
                cumulative_count: cumulative,
            });
        }

        Ok(Self {
            num_buckets,
            total_rows,
            buckets,
        })
    }

    /// Estimate selectivity for equality predicate
    pub fn estimate_eq(&self, value: &Value) -> f64 {
        for bucket in &self.buckets {
            if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                // Assume uniform distribution within bucket
                let bucket_selectivity = 1.0 / bucket.distinct_count.max(1) as f64;
                return bucket_selectivity * (bucket.row_count as f64 / self.total_rows as f64);
            }
        }
        0.0  // Value outside histogram range
    }

    /// Estimate selectivity for range predicate (value > X)
    pub fn estimate_gt(&self, value: &Value) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            if value < &bucket.lower_bound {
                // Entire bucket matches
                selectivity += bucket.row_count as f64 / self.total_rows as f64;
            } else if value >= &bucket.lower_bound && value < &bucket.upper_bound {
                // Partial bucket - assume uniform within bucket
                let bucket_range = bucket.upper_bound.subtract(&bucket.lower_bound).to_f64();
                let matching_range = bucket.upper_bound.subtract(value).to_f64();
                let fraction = (matching_range / bucket_range).clamp(0.0, 1.0);
                selectivity += fraction * (bucket.row_count as f64 / self.total_rows as f64);
            }
            // else: value >= bucket.upper_bound, no match
        }

        selectivity
    }

    /// Estimate selectivity for range predicate (value < X)
    pub fn estimate_lt(&self, value: &Value) -> f64 {
        1.0 - self.estimate_gte(value)
    }

    /// Estimate selectivity for range predicate (value >= X)
    pub fn estimate_gte(&self, value: &Value) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            if value <= &bucket.lower_bound {
                // Entire bucket matches
                selectivity += bucket.row_count as f64 / self.total_rows as f64;
            } else if value > &bucket.lower_bound && value <= &bucket.upper_bound {
                // Partial bucket
                let bucket_range = bucket.upper_bound.subtract(&bucket.lower_bound).to_f64();
                let matching_range = bucket.upper_bound.subtract(value).to_f64();
                let fraction = (matching_range / bucket_range).clamp(0.0, 1.0);
                selectivity += fraction * (bucket.row_count as f64 / self.total_rows as f64);
            }
        }

        selectivity
    }
}

Pros: Better for skewed data, each bucket has equal weight.

Cons: Bucket boundaries can be arbitrary for data with many duplicates.

Compressed Histograms (Handling Duplicates)

Separate handling for most common values (MCV):

Data: [1, 1, 1, 1, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Compressed Histogram:
┌─────────────────────────────────────────────────────────────┐
│ Most Common Values (MCV):                                   │
│   Value 1: frequency = 5/14 = 35.7%                        │
├─────────────────────────────────────────────────────────────┤
│ Histogram (for remaining values):                           │
│   Bucket 1: [2-5]   → 4 rows                                │
│   Bucket 2: [6-10]  → 1 row                                 │
│   Bucket 3: [11-14] → 4 rows                                │
└─────────────────────────────────────────────────────────────┘

#[derive(Debug, Clone)]
pub struct CompressedHistogram {
    pub mcv_list: MCVList,
    pub histogram: EquiDepthHistogram,
    pub null_fraction: f64,
}

#[derive(Debug, Clone)]
pub struct MCVList {
    pub values: Vec<(Value, f64)>,  // (value, frequency)
    pub total_mcv_frequency: f64,
}

impl CompressedHistogram {
    pub fn build(values: &[Value], num_buckets: usize, num_mcv: usize) -> Result<Self, HistogramError> {
        // Handle NULLs
        let null_count = values.iter().filter(|v| v.is_null()).count();
        let null_fraction = null_count as f64 / values.len() as f64;

        let non_null_values: Vec<_> = values.iter().filter(|v| !v.is_null()).collect();

        // Build MCV list
        let mcv_list = Self::build_mcv_list(&non_null_values, num_mcv);

        // Build histogram for non-MCV values
        let mcv_values: std::collections::HashSet<_> = 
            mcv_list.values.iter().map(|(v, _)| v).collect();
        let non_mcv_values: Vec<_> = non_null_values
            .iter()
            .filter(|v| !mcv_values.contains(v))
            .cloned()
            .collect();

        let histogram = if non_mcv_values.is_empty() {
            // All values are in MCV, create empty histogram
            EquiDepthHistogram {
                num_buckets: 0,
                total_rows: 0,
                buckets: Vec::new(),
            }
        } else {
            EquiDepthHistogram::build(&non_mcv_values, num_buckets)?
        };

        Ok(Self {
            mcv_list,
            histogram,
            null_fraction,
        })
    }

    fn build_mcv_list(values: &[Value], num_mcv: usize) -> MCVList {
        let mut value_counts: std::collections::HashMap<&Value, usize> = 
            std::collections::HashMap::new();

        for value in values {
            *value_counts.entry(value).or_insert(0) += 1;
        }

        let mut mcv: Vec<_> = value_counts.into_iter().collect();
        mcv.sort_by(|a, b| b.1.cmp(&a.1));  // Sort by count descending

        let total_mcv_frequency = mcv.iter().map(|(_, c)| *c as f64).sum::<f64>() / values.len() as f64;

        let mcv_values = mcv
            .into_iter()
            .take(num_mcv)
            .map(|(v, c)| (v.clone(), c as f64 / values.len() as f64))
            .collect();

        MCVList {
            values: mcv_values,
            total_mcv_frequency,
        }
    }

    /// Estimate selectivity with MCV awareness
    pub fn estimate_selectivity(&self, op: &BinaryOperator, value: &Value) -> f64 {
        // Check MCV first
        for (mcv_val, frequency) in &self.mcv_list.values {
            if mcv_val == value {
                return match op {
                    BinaryOperator::Eq => *frequency,
                    BinaryOperator::Neq => 1.0 - frequency,
                    _ => {
                        // For range ops, MCV exact match doesn't help much
                        // Fall through to histogram estimation
                        0.0
                    }
                };
            }
        }

        // Not in MCV, use histogram
        let histogram_selectivity = match op {
            BinaryOperator::Eq => self.histogram.estimate_eq(value),
            BinaryOperator::Gt => self.histogram.estimate_gt(value),
            BinaryOperator::Lt => self.histogram.estimate_lt(value),
            BinaryOperator::Gte => self.histogram.estimate_gte(value),
            BinaryOperator::Lte => 1.0 - self.histogram.estimate_gt(value),
            _ => 0.5,
        };

        // Adjust for non-MCV portion
        let non_mcv_fraction = 1.0 - self.mcv_list.total_mcv_frequency;
        histogram_selectivity * non_mcv_fraction
    }
}

Pros: Best of both worlds - accurate for common values, good distribution for rest.

Cons: More complex, requires more memory.

3 Sampling for Large Tables

The Problem: Full Table Scans Are Expensive

-- Table: users (100 million rows)
-- Building histogram requires sorting all values

-- Full scan approach:
SELECT balance FROM users ORDER BY balance;
-- Time: 30 minutes (!)
-- I/O: Read entire table

-- Not practical for routine ANALYZE

Solution: Statistical Sampling

// src/optimizer/sampling.rs
use rand::seq::SliceRandom;
use rand::thread_rng;

pub struct Sampler {
    sample_size: usize,
    confidence: f64,
}

impl Sampler {
    pub fn new(sample_size: usize, confidence: f64) -> Self {
        Self { sample_size, confidence }
    }

    /// Reservoir sampling for streaming data
    pub fn reservoir_sample<T: Clone>(
        &self,
        stream: impl Iterator<Item = T>,
    ) -> Vec<T> {
        let mut reservoir = Vec::with_capacity(self.sample_size);
        let mut count = 0;

        for item in stream {
            count += 1;

            if reservoir.len() < self.sample_size {
                reservoir.push(item);
            } else {
                // Replace with probability sample_size / count
                let j = thread_rng().gen_range(0..count);
                if j < self.sample_size {
                    reservoir[j] = item;
                }
            }
        }

        reservoir
    }

    /// Simple random sampling for in-memory data
    pub fn simple_random_sample<T: Clone>(&self, values: &[T]) -> Vec<T> {
        let mut rng = thread_rng();
        let mut sample: Vec<T> = values.to_vec();
        sample.shuffle(&mut rng);
        sample.truncate(self.sample_size);
        sample
    }

    /// Calculate confidence interval for estimate
    pub fn confidence_interval(&self, sample_proportion: f64, sample_size: usize) -> (f64, f64) {
        // z-score for confidence level (1.96 for 95%)
        let z = match self.confidence {
            0.99 => 2.576,
            0.95 => 1.96,
            0.90 => 1.645,
            _ => 1.96,
        };

        let standard_error = (sample_proportion * (1.0 - sample_proportion) / sample_size as f64).sqrt();
        let margin = z * standard_error;

        (sample_proportion - margin, sample_proportion + margin)
    }
}

// Usage in histogram building
impl EquiDepthHistogram {
    pub fn build_from_sample(
        values: &[Value],
        num_buckets: usize,
        sample_size: usize,
    ) -> Result<Self, HistogramError> {
        let sampler = Sampler::new(sample_size, 0.95);
        let sample = sampler.simple_random_sample(values);

        // Build histogram from sample
        let mut histogram = Self::build(&sample, num_buckets)?;

        // Scale row counts to match full table
        let scale_factor = values.len() as f64 / sample.len() as f64;
        for bucket in &mut histogram.buckets {
            bucket.row_count = (bucket.row_count as f64 * scale_factor).round() as u64;
            bucket.cumulative_count = (bucket.cumulative_count as f64 * scale_factor).round() as u64;
        }
        histogram.total_rows = values.len() as u64;

        Ok(histogram)
    }
}

Sample Size Guidelines

Table Size	Recommended Sample	Accuracy (95% CI)
< 10K rows	100% (full scan)	Exact
10K-1M rows	10%	±1%
1M-100M rows	1%	±0.1%
> 100M rows	0.1% or 100K rows	±0.03%

4 Multi-Column Statistics

The Problem: Column Correlation

-- Table: users (1,000,000 rows)
-- Columns: country, city

-- Individual column statistics:
-- country: 'US' = 50%, 'UK' = 30%, 'DE' = 20%
-- city: 'London' = 5%, 'Berlin' = 10%, 'Munich' = 5%

-- Query:
SELECT * FROM users WHERE country = 'UK' AND city = 'London'

-- Optimizer (assuming independence):
-- Selectivity = 0.30 × 0.05 = 0.015 = 1.5%
-- Estimated rows: 15,000

-- Reality (London only exists in UK):
-- Actual selectivity = 5% (all London users are in UK)
-- Actual rows: 50,000

-- Error: 3.3x underestimate!

Multi-Column Histograms

#[derive(Debug, Clone)]
pub struct MultiColumnHistogram {
    pub columns: Vec<String>,
    pub num_buckets: usize,
    pub total_rows: u64,
    pub buckets: Vec<MultiColumnBucket>,
}

#[derive(Debug, Clone)]
pub struct MultiColumnBucket {
    pub bounds: Vec<(Value, Value)>,  // (min, max) for each column
    pub row_count: u64,
    pub distinct_count: u64,
}

impl MultiColumnHistogram {
    pub fn build(
        data: &[Vec<Value>],  // Each row is [col1_value, col2_value, ...]
        columns: Vec<String>,
        num_buckets: usize,
    ) -> Result<Self, HistogramError> {
        if data.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        // Use k-d tree style partitioning for multi-dimensional buckets
        let buckets = Self::build_kd_histogram(data, num_buckets)?;

        Ok(Self {
            columns,
            num_buckets,
            total_rows: data.len() as u64,
            buckets,
        })
    }

    fn build_kd_histogram(
        data: &[Vec<Value>],
        num_buckets: usize,
    ) -> Result<Vec<MultiColumnBucket>, HistogramError> {
        // Simplified: split on dimension with highest variance
        let num_dims = data[0].len();
        let target_bucket_size = data.len() / num_buckets;

        let mut buckets = Vec::new();
        Self::partition_data(data, 0, num_buckets, &mut buckets, target_bucket_size)?;

        Ok(buckets)
    }

    fn partition_data(
        data: &[Vec<Value>],
        depth: usize,
        remaining_buckets: usize,
        buckets: &mut Vec<MultiColumnBucket>,
        target_size: usize,
    ) -> Result<(), HistogramError> {
        if remaining_buckets == 1 || data.len() <= target_size {
            // Create leaf bucket
            let bucket = Self::create_bucket(data)?;
            buckets.push(bucket);
            return Ok(());
        }

        // Split on dimension with highest range
        let dim = depth % data[0].len();
        let mut sorted = data.to_vec();
        sorted.sort_by(|a, b| a[dim].cmp(&b[dim]));

        let mid = sorted.len() / 2;
        let (left, right) = sorted.split_at(mid);

        Self::partition_data(left, depth + 1, remaining_buckets / 2, buckets, target_size)?;
        Self::partition_data(right, depth + 1, remaining_buckets - remaining_buckets / 2, buckets, target_size)?;

        Ok(())
    }

    fn create_bucket(data: &[Vec<Value>]) -> Result<MultiColumnBucket, HistogramError> {
        let num_dims = data[0].len();
        let mut bounds = Vec::new();

        for dim in 0..num_dims {
            let mut values: Vec<_> = data.iter().map(|row| &row[dim]).collect();
            values.sort();
            bounds.push((values.first().unwrap().clone(), values.last().unwrap().clone()));
        }

        let distinct_count = data.iter().collect::<std::collections::HashSet<_>>().len() as u64;

        Ok(MultiColumnBucket {
            bounds,
            row_count: data.len() as u64,
            distinct_count,
        })
    }

    pub fn estimate_selectivity(&self, predicates: &[(usize, BinaryOperator, Value)]) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            let bucket_selectivity = self.estimate_bucket_selectivity(bucket, predicates);
            selectivity += bucket_selectivity * (bucket.row_count as f64 / self.total_rows as f64);
        }

        selectivity
    }

    fn estimate_bucket_selectivity(
        &self,
        bucket: &MultiColumnBucket,
        predicates: &[(usize, BinaryOperator, Value)],
    ) -> f64 {
        let mut bucket_selectivity = 1.0;

        for (col_idx, op, value) in predicates {
            if *col_idx >= bucket.bounds.len() {
                continue;
            }

            let (min, max) = &bucket.bounds[*col_idx];

            let col_selectivity = match op {
                BinaryOperator::Eq => {
                    if value >= min && value <= max {
                        1.0 / bucket.distinct_count.max(1) as f64
                    } else {
                        0.0
                    }
                }
                BinaryOperator::Gt => {
                    if value < min {
                        1.0
                    } else if value >= max {
                        0.0
                    } else {
                        let range = max.subtract(min).to_f64();
                        let matching = max.subtract(value).to_f64();
                        (matching / range).clamp(0.0, 1.0)
                    }
                }
                // ... handle other operators
                _ => 0.5,
            };

            bucket_selectivity *= col_selectivity;
        }

        bucket_selectivity
    }
}

When to use multi-column stats:

Scenario	Recommendation
Columns are independent	Single-column histograms
Strong correlation (city → country)	Multi-column histogram
Frequently used together in WHERE	Multi-column histogram
High cardinality combination	May not be worth it

5 Using Histograms in the Optimizer

Integrating with Cost Model

// src/optimizer/histogram_selectivity.rs
pub struct HistogramSelectivityEstimator {
    statistics: Arc<StatisticsCatalog>,
}

impl HistogramSelectivityEstimator {
    pub fn new(statistics: Arc<StatisticsCatalog>) -> Self {
        Self { statistics }
    }

    pub fn estimate(&self, table: &str, column: &str, expr: &Expression) -> f64 {
        let table_stats = match self.statistics.get_table(table) {
            Some(stats) => stats,
            None => return 0.5,  // No stats, use default
        };

        let col_stats = match table_stats.columns.get(column) {
            Some(stats) => stats,
            None => return 0.5,
        };

        match &col_stats.histogram {
            Some(Histogram::Compressed(comp)) => {
                match expr {
                    Expression::BinaryOp { op, right, .. } => {
                        comp.estimate_selectivity(op, right)
                    }
                    Expression::InList { list, .. } => {
                        // Sum selectivity for each value
                        list.iter()
                            .map(|v| comp.estimate_selectivity(&BinaryOperator::Eq, v))
                            .sum()
                    }
                    Expression::Between { low, high, .. } => {
                        let low_sel = comp.estimate_selectivity(&BinaryOperator::Gte, low);
                        let high_sel = comp.estimate_selectivity(&BinaryOperator::Lte, high);
                        (low_sel + high_sel - 1.0).max(0.0)
                    }
                    _ => 0.5,
                }
            }
            Some(Histogram::EquiDepth(equi)) => {
                match expr {
                    Expression::BinaryOp { op, right, .. } => {
                        match op {
                            BinaryOperator::Eq => equi.estimate_eq(right),
                            BinaryOperator::Gt => equi.estimate_gt(right),
                            BinaryOperator::Lt => equi.estimate_lt(right),
                            BinaryOperator::Gte => equi.estimate_gte(right),
                            BinaryOperator::Lte => 1.0 - equi.estimate_gt(right),
                            _ => 0.5,
                        }
                    }
                    _ => 0.5,
                }
            }
            None => {
                // No histogram, fall back to MCV or default
                self.estimate_from_mcv(col_stats, expr)
            }
        }
    }

    fn estimate_from_mcv(&self, col_stats: &ColumnStatistics, expr: &Expression) -> f64 {
        match expr {
            Expression::BinaryOp { op: BinaryOperator::Eq, right, .. } => {
                // Check if value is in MCV list
                for (mcv_val, frequency) in &col_stats.most_common_values {
                    if mcv_val == right {
                        return *frequency;
                    }
                }
                // Not in MCV, estimate based on distinct count
                1.0 / col_stats.distinct_count.max(1) as f64
            }
            _ => 0.5,  // Default for other operators
        }
    }
}

// Updated cost model using histograms
impl CostModel {
    pub fn estimate_selectivity_with_histograms(
        &self,
        estimator: &HistogramSelectivityEstimator,
        table: &str,
        expr: &Expression,
    ) -> f64 {
        match expr {
            Expression::BinaryOp { op, left, right, .. } => {
                // Find which side is a column reference
                if let Expression::Identifier(ident) = left.as_ref() {
                    return estimator.estimate(table, &ident.value, &Expression::BinaryOp {
                        op: op.clone(),
                        left: Box::new(Expression::Identifier(ident.clone())),
                        right: right.clone(),
                    });
                }
                // Default estimate
                0.5
            }
            Expression::InList { expr, list, negated } => {
                let base_selectivity = list.iter()
                    .map(|v| estimator.estimate(table, "column", v))
                    .sum::<f64>()
                    .min(1.0);
                if *negated { 1.0 - base_selectivity } else { base_selectivity }
            }
            _ => 0.5,
        }
    }
}

Example: Real Query Optimization

-- Table statistics (from ANALYZE):
-- users: 1,000,000 rows
-- users.balance: histogram shows 80% have balance < $100

-- Query:
SELECT * FROM users WHERE balance > 100

-- Without histogram:
-- Selectivity estimate: 0.33 (fixed guess)
-- Estimated rows: 333,333
-- Chosen plan: SeqScan (cost: 5000)

-- With histogram:
-- Selectivity estimate: 0.20 (from histogram)
-- Estimated rows: 200,000
-- Chosen plan: IndexScan (cost: 2000) ← Better!

6 Histogram Maintenance

When to Update Statistics

// src/optimizer/stats_maintenance.rs
pub struct StatsMaintenancePolicy {
    auto_analyze_threshold: u64,  // Rows modified before auto-analyze
    auto_analyze_scale_factor: f64,  // Fraction of table
    last_analyze_threshold: chrono::Duration,  // Max time since last analyze
}

impl Default for StatsMaintenancePolicy {
    fn default() -> Self {
        Self {
            auto_analyze_threshold: 50,  // PostgreSQL default
            auto_analyze_scale_factor: 0.2,  // 20% of table
            last_analyze_threshold: chrono::Duration::days(7),
        }
    }
}

pub struct StatsMaintenanceManager {
    catalog: Arc<Catalog>,
    policy: StatsMaintenancePolicy,
    modification_counts: HashMap<String, u64>,  // table → rows modified
}

impl StatsMaintenanceManager {
    pub fn check_and_analyze(&mut self, table: &str) -> Result<bool, AnalyzerError> {
        let table_stats = self.catalog.get_table_statistics(table);
        let modified_rows = self.modification_counts.get(table).copied().unwrap_or(0);

        let needs_analyze = if let Some(stats) = table_stats {
            let threshold = self.policy.auto_analyze_threshold
                + (stats.row_count as f64 * self.policy.auto_analyze_scale_factor) as u64;
            modified_rows >= threshold
        } else {
            true  // No stats, need initial analyze
        };

        if needs_analyze {
            self.analyze_table(table)?;
            self.modification_counts.insert(table.to_string(), 0);
            Ok(true)
        } else {
            Ok(false)
        }
    }

    pub fn record_modification(&mut self, table: &str, rows_affected: u64) {
        let count = self.modification_counts.entry(table.to_string()).or_insert(0);
        *count += rows_affected;
    }

    fn analyze_table(&self, table: &str) -> Result<(), AnalyzerError> {
        // Run ANALYZE on the table
        let analyzer = StatisticsAnalyzer::new(self.catalog.clone());
        let stats = analyzer.analyze_table(table)?;
        self.catalog.store_statistics(&stats)?;
        Ok(())
    }
}

Incremental Histogram Updates

Problem: Full rebuild is expensive for large tables.

Solution: Incremental updates for small changes.

impl EquiDepthHistogram {
    /// Update histogram with new values (for small changes)
    pub fn update(&mut self, new_values: &[Value], deleted_values: &[Value]) {
        // Update total row count
        self.total_rows += new_values.len() as u64;
        self.total_rows -= deleted_values.len() as u64;

        // For each new value, find its bucket and increment count
        for value in new_values {
            for bucket in &mut self.buckets {
                if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                    bucket.row_count += 1;
                    bucket.cumulative_count += 1;
                    break;
                }
            }
        }

        // For deleted values, decrement counts
        for value in deleted_values {
            for bucket in &mut self.buckets {
                if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                    bucket.row_count = bucket.row_count.saturating_sub(1);
                    bucket.cumulative_count = bucket.cumulative_count.saturating_sub(1);
                    break;
                }
            }
        }

        // If too many changes, trigger full rebuild
        let total_changes = new_values.len() + deleted_values.len();
        if total_changes as f64 / self.total_rows as f64 > 0.1 {
            // Mark for full rebuild
            self.needs_rebuild = true;
        }
    }
}

7 Challenges Building in Rust

Challenge 1: Value Comparison

Problem: SQL values can be different types (int, float, string, date).

// ❌ Doesn't work - can't compare different types
pub enum Value {
    Integer(i64),
    Float(f64),
    String(String),
    Date(chrono::NaiveDate),
}

impl Ord for Value {
    fn cmp(&self, other: &Self) -> Ordering {
        // Can't compare Integer with String!
    }
}

Solution: Type-aware comparison

// ✅ Works - handle type mismatches
impl Value {
    pub fn compare(&self, other: &Self) -> Result<Ordering, ValueError> {
        match (self, other) {
            (Value::Integer(a), Value::Integer(b)) => Ok(a.cmp(b)),
            (Value::Float(a), Value::Float(b)) => a.partial_cmp(b).ok_or(ValueError::NaN),
            (Value::String(a), Value::String(b)) => Ok(a.cmp(b)),
            (Value::Integer(a), Value::Float(b)) => (*a as f64).partial_cmp(b).ok_or(ValueError::NaN),
            (Value::Float(a), Value::Integer(b)) => a.partial_cmp(&(*b as f64)).ok_or(ValueError::NaN),
            _ => Err(ValueError::TypeMismatch),
        }
    }
}

Challenge 2: Memory Usage

Problem: Histograms for many columns can use significant memory.

// ❌ Memory explosion
// 1000 tables × 20 columns × 100 buckets × 32 bytes = 64 MB

Solution: Compress bucket bounds

// ✅ Compressed representation
pub struct CompressedBucket {
    pub lower_bound_idx: u32,  // Index into shared value pool
    pub upper_bound_idx: u32,
    pub row_count: u32,
}

pub struct CompressedHistogram {
    shared_values: Vec<Value>,  // Deduplicated values
    buckets: Vec<CompressedBucket>,
}

Challenge 3: Thread-Safe Statistics

Problem: Statistics need to be read during query planning while being updated by ANALYZE.

// ❌ Race condition
pub struct StatisticsCatalog {
    tables: HashMap<String, TableStatistics>,
}

Solution: Arc<RwLock<>> for concurrent access

// ✅ Thread-safe
pub struct StatisticsCatalog {
    tables: Arc<RwLock<HashMap<String, Arc<TableStatistics>>>>,
}

impl StatisticsCatalog {
    pub fn get_table(&self, name: &str) -> Option<Arc<TableStatistics>> {
        let tables = self.tables.read().unwrap();
        tables.get(name).cloned()
    }

    pub fn update_table(&self, stats: TableStatistics) {
        let mut tables = self.tables.write().unwrap();
        tables.insert(stats.table_name.clone(), Arc::new(stats));
    }
}

8 How AI Accelerated This

What AI Got Right

Task	AI Contribution
Histogram types	Explained equi-width vs. equi-depth trade-offs
MCV handling	Suggested separate MCV list for common values
Sampling	Reservoir sampling algorithm for streaming
Selectivity formulas	Correct range estimation within buckets

What AI Got Wrong

Issue	What Happened
Bucket boundaries	First draft had off-by-one errors in boundary calculation
Multi-column correlation	Initially assumed independence (wrong for correlated columns)
NULL handling	Forgot to track NULL fraction separately
Incremental updates	Suggested full rebuild for any change (too expensive)

Pattern: AI handles textbook cases well. Edge cases (boundaries, NULLs, incremental) need manual refinement.

Example: Debugging Selectivity

My question to AI:

“Histogram shows 20% of rows have balance > 100, but optimizer estimates 5%. Why?”

What I learned:

MCV list wasn’t being checked first
Non-MCV fraction adjustment was missing
Range estimation within bucket was inverted

Result: Fixed selectivity estimation:

pub fn estimate_selectivity(&self, op: &BinaryOperator, value: &Value) -> f64 {
    // Check MCV first
    for (mcv_val, frequency) in &self.mcv_list.values {
        if mcv_val == value {
            return match op {
                BinaryOperator::Eq => *frequency,
                BinaryOperator::Neq => 1.0 - frequency,
                _ => 0.0  // Fall through for range ops
            };
        }
    }

    // Use histogram for non-MCV values
    let histogram_sel = self.histogram.estimate_gt(value);
    let non_mcv_fraction = 1.0 - self.mcv_list.total_mcv_frequency;
    histogram_sel * non_mcv_fraction  // ← Was missing this!
}

Summary: Histogram Statistics in One Diagram

flowchart TD subgraph "Data Collection" A[Table Data] --> B{Sample or Full?} B -->|Small table| C[Full Scan] B -->|Large table| D[Reservoir Sampling] end subgraph "Histogram Building" C --> E[Sort Values] D --> E E --> F{Build Type?} F -->|Uniform data| G[Equi-Width] F -->|Skewed data| H[Equi-Depth] F -->|With duplicates| I[Compressed + MCV] end subgraph "Multi-Column" J[Correlated Columns] --> K[Multi-Dimensional Buckets] end subgraph "Usage in Optimizer" H --> L[Selectivity Estimation] I --> L K --> L L --> M[Cost Calculation] M --> N[Plan Selection] end subgraph "Maintenance" O[INSERT/UPDATE/DELETE] --> P[Track Modifications] P --> Q{Threshold Reached?} Q -->|Yes| R[Auto-ANALYZE] Q -->|No| S[Incremental Update] R --> E end style A fill:#e3f2fd,stroke:#1976d2 style N fill:#e8f5e9,stroke:#388e3c style L fill:#fff3e0,stroke:#f57c00

Key Takeaways:

Concept	Why It Matters
Equi-depth histograms	Better for skewed data than equi-width
MCV lists	Accurate for common values
Sampling	Makes ANALYZE practical for large tables
Multi-column stats	Captures column correlation
Incremental updates	Avoids full rebuild for small changes
Thread-safe access	Allows concurrent reads during updates

Further Reading:

PostgreSQL Source: src/backend/commands/analyze.c
“Random Sampling for Histogram Construction” by Vitter et al.
“Selectivity Estimation for Range Queries” by Ioannidis et al.
PostgreSQL Documentation: “Statistics Used by the Planner”
Vaultgres Repository: github.com/neoalienson/Vaultgres

Comments

Please accept the "Functionality" cookie category to view and post comments.