使用 Rust 构建 PostgreSQL 兼容数据库:用于准确选择性估计的直方图统计

  1. 1 为什么直方图很重要
  2. 2 直方图类型
  3. 3 大型表的采样
  4. 4 多列统计
  5. 5 在优化器中使用直方图
  6. 6 直方图维护
  7. 7 用 Rust 构建的挑战
  8. 8 AI 如何加速这项工作
  9. 总结:直方图统计一张图

第七部分 中,我们构建了一个基于成本的优化器。但有个问题。

我们的选择性估计是猜测:

// From Part 7 - simplified (wrong!) estimates
fn estimate_selectivity(&self, expr: &Expression) -> f64 {
    match expr {
        BinaryOperator::Eq => 0.01,   // Always 1%? Wrong!
        BinaryOperator::Gt => 0.33,   // Always 33%? Wrong!
        _ => 0.5,                      // Always 50%? Very wrong!
    }
}

真实数据不是均匀的:

User balances in our database:

Balance Distribution:
$0-100:     ████████████████████████████████████████  80% of users
$100-1K:    ████████                                   15% of users
$1K-10K:    ██                                          4% of users
$10K-100K:  █                                           1% of users

Query: SELECT * FROM users WHERE balance > 100

Our estimate: 33% of rows (using fixed 0.33)
Actual: 20% of rows
→ Wrong plan chosen! Index scan would be better than seq scan.

解决方案: 捕捉实际数据分布的直方图。

今天:在 Rust 中实现直方图统计以进行准确的选择性估计。


1 为什么直方图很重要

均匀分布谬误

没有直方图,优化器假设均匀分布:

-- Table: users (1,000,000 rows)
-- Column: account_type ('free', 'premium', 'enterprise')

-- Reality:
'free':       950,000 rows (95%)
'premium':     45,000 rows (4.5%)
'enterprise':   5,000 rows (0.5%)

-- Query 1:
SELECT * FROM users WHERE account_type = 'enterprise'
-- Optimizer estimate (uniform): 1,000,000 / 3 = 333,333 rows
-- Actual: 5,000 rows
-- Error: 67x overestimate!

-- Query 2:
SELECT * FROM users WHERE account_type = 'free'
-- Optimizer estimate (uniform): 333,333 rows
-- Actual: 950,000 rows
-- Error: 3x underestimate!

有直方图: 我们知道实际分布。


对连接顺序的影响

SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.account_type = 'enterprise'

没有直方图:

Optimizer thinks 'enterprise' = 333,333 rows
Plan: SeqScan(users) → Hash Join → SeqScan(orders)
Cost: 5000 (wrong!)

有直方图:

Histogram shows 'enterprise' = 5,000 rows
Plan: IndexScan(users) → Nested Loop → IndexScan(orders)
Cost: 100 (correct!)

结果: 查询快 50 倍。


2 直方图类型

等宽直方图

相等的桶范围,不同的行数:

Data: [1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Equi-Width (3 buckets, range 1-14):
┌─────────────────────────────────────────────────────────────┐
│ Bucket 1: [1-5]     → 5 rows  │████████████████████████████│
│ Bucket 2: [6-10]    → 1 row   │██████                      │
│ Bucket 3: [11-14]   → 4 rows  │██████████████████████      │
└─────────────────────────────────────────────────────────────┘
#[derive(Debug, Clone)]
pub struct EquiWidthHistogram {
    pub min_value: Value,
    pub max_value: Value,
    pub num_buckets: usize,
    pub buckets: Vec<WidthBucket>,
}

#[derive(Debug, Clone)]
pub struct WidthBucket {
    pub lower_bound: Value,
    pub upper_bound: Value,
    pub row_count: u64,
    pub distinct_count: u64,
    pub null_count: u64,
}

impl EquiWidthHistogram {
    pub fn build(values: &[Value], num_buckets: usize) -> Result<Self, HistogramError> {
        if values.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        let (min, max) = Self::find_min_max(values)?;
        let bucket_width = max.subtract(&min).divide(num_buckets as f64);

        let mut buckets = vec![WidthBucket {
            lower_bound: min.clone(),
            upper_bound: min.clone(),
            row_count: 0,
            distinct_count: 0,
            null_count: 0,
        }; num_buckets];

        // Initialize bucket bounds
        for i in 0..num_buckets {
            buckets[i].lower_bound = min.add(bucket_width.multiply(i as f64));
            buckets[i].upper_bound = min.add(bucket_width.multiply((i + 1) as f64));
        }

        // Count rows in each bucket
        for value in values {
            if value.is_null() {
                buckets[0].null_count += 1;
                continue;
            }

            let bucket_idx = Self::find_bucket_index(value, &min, &bucket_width, num_buckets);
            if bucket_idx < num_buckets {
                buckets[bucket_idx].row_count += 1;
            }
        }

        Ok(Self {
            min_value: min,
            max_value: max,
            num_buckets,
            buckets,
        })
    }

    fn find_bucket_index(value: &Value, min: &Value, width: &Value, num_buckets: usize) -> usize {
        let offset = value.subtract(min);
        let bucket = offset.divide(width).to_f64().floor() as usize;
        bucket.min(num_buckets - 1)
    }
}

优点: 简单,适合均匀数据。

缺点: 对于偏斜数据效果差(大多数桶为空,一个桶巨大)。


等深直方图(PostgreSQL 风格)

每个桶的行数相等,范围不同:

Data: [1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Equi-Depth (3 buckets, ~3-4 rows each):
┌─────────────────────────────────────────────────────────────┐
│ Bucket 1: [1-4]     → 4 rows  │████████████████████████████│
│ Bucket 2: [5-11]    → 3 rows  │████████████████████████████│
│ Bucket 3: [12-14]   → 3 rows  │████████████████████████████│
└─────────────────────────────────────────────────────────────┘
#[derive(Debug, Clone)]
pub struct EquiDepthHistogram {
    pub num_buckets: usize,
    pub total_rows: u64,
    pub buckets: Vec<DepthBucket>,
}

#[derive(Debug, Clone)]
pub struct DepthBucket {
    pub lower_bound: Value,
    pub upper_bound: Value,
    pub row_count: u64,
    pub distinct_count: u64,
    pub cumulative_count: u64,  // Rows <= upper_bound
}

impl EquiDepthHistogram {
    pub fn build(values: &[Value], num_buckets: usize) -> Result<Self, HistogramError> {
        if values.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        // Sort values
        let mut sorted_values = values.to_vec();
        sorted_values.sort();

        let total_rows = sorted_values.len() as u64;
        let target_bucket_size = (total_rows as f64 / num_buckets as f64).ceil() as usize;

        let mut buckets = Vec::new();
        let mut cumulative = 0u64;

        for i in 0..num_buckets {
            let start = i * target_bucket_size;
            let end = ((i + 1) * target_bucket_size).min(sorted_values.len());

            if start >= sorted_values.len() {
                break;
            }

            let bucket_values = &sorted_values[start..end];
            let row_count = bucket_values.len() as u64;
            let distinct_count = bucket_values.iter().collect::<std::collections::HashSet<_>>().len() as u64;

            cumulative += row_count;

            buckets.push(DepthBucket {
                lower_bound: bucket_values.first().unwrap().clone(),
                upper_bound: bucket_values.last().unwrap().clone(),
                row_count,
                distinct_count,
                cumulative_count: cumulative,
            });
        }

        Ok(Self {
            num_buckets,
            total_rows,
            buckets,
        })
    }

    /// Estimate selectivity for equality predicate
    pub fn estimate_eq(&self, value: &Value) -> f64 {
        for bucket in &self.buckets {
            if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                // Assume uniform distribution within bucket
                let bucket_selectivity = 1.0 / bucket.distinct_count.max(1) as f64;
                return bucket_selectivity * (bucket.row_count as f64 / self.total_rows as f64);
            }
        }
        0.0  // Value outside histogram range
    }

    /// Estimate selectivity for range predicate (value > X)
    pub fn estimate_gt(&self, value: &Value) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            if value < &bucket.lower_bound {
                // Entire bucket matches
                selectivity += bucket.row_count as f64 / self.total_rows as f64;
            } else if value >= &bucket.lower_bound && value < &bucket.upper_bound {
                // Partial bucket - assume uniform within bucket
                let bucket_range = bucket.upper_bound.subtract(&bucket.lower_bound).to_f64();
                let matching_range = bucket.upper_bound.subtract(value).to_f64();
                let fraction = (matching_range / bucket_range).clamp(0.0, 1.0);
                selectivity += fraction * (bucket.row_count as f64 / self.total_rows as f64);
            }
            // else: value >= bucket.upper_bound, no match
        }

        selectivity
    }

    /// Estimate selectivity for range predicate (value < X)
    pub fn estimate_lt(&self, value: &Value) -> f64 {
        1.0 - self.estimate_gte(value)
    }

    /// Estimate selectivity for range predicate (value >= X)
    pub fn estimate_gte(&self, value: &Value) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            if value <= &bucket.lower_bound {
                // Entire bucket matches
                selectivity += bucket.row_count as f64 / self.total_rows as f64;
            } else if value > &bucket.lower_bound && value <= &bucket.upper_bound {
                // Partial bucket
                let bucket_range = bucket.upper_bound.subtract(&bucket.lower_bound).to_f64();
                let matching_range = bucket.upper_bound.subtract(value).to_f64();
                let fraction = (matching_range / bucket_range).clamp(0.0, 1.0);
                selectivity += fraction * (bucket.row_count as f64 / self.total_rows as f64);
            }
        }

        selectivity
    }
}

优点: 对于偏斜数据更好,每个桶有相等的权重。

缺点: 对于有大量重复的数据,桶边界可能是任意的。


压缩直方图(处理重复)

单独处理最常见值(MCV):

Data: [1, 1, 1, 1, 1, 2, 3, 4, 5, 10, 11, 12, 13, 14]

Compressed Histogram:
┌─────────────────────────────────────────────────────────────┐
│ Most Common Values (MCV):                                   │
│   Value 1: frequency = 5/14 = 35.7%                        │
├─────────────────────────────────────────────────────────────┤
│ Histogram (for remaining values):                           │
│   Bucket 1: [2-5]   → 4 rows                                │
│   Bucket 2: [6-10]  → 1 row                                 │
│   Bucket 3: [11-14] → 4 rows                                │
└─────────────────────────────────────────────────────────────┘
#[derive(Debug, Clone)]
pub struct CompressedHistogram {
    pub mcv_list: MCVList,
    pub histogram: EquiDepthHistogram,
    pub null_fraction: f64,
}

#[derive(Debug, Clone)]
pub struct MCVList {
    pub values: Vec<(Value, f64)>,  // (value, frequency)
    pub total_mcv_frequency: f64,
}

impl CompressedHistogram {
    pub fn build(values: &[Value], num_buckets: usize, num_mcv: usize) -> Result<Self, HistogramError> {
        // Handle NULLs
        let null_count = values.iter().filter(|v| v.is_null()).count();
        let null_fraction = null_count as f64 / values.len() as f64;

        let non_null_values: Vec<_> = values.iter().filter(|v| !v.is_null()).collect();

        // Build MCV list
        let mcv_list = Self::build_mcv_list(&non_null_values, num_mcv);

        // Build histogram for non-MCV values
        let mcv_values: std::collections::HashSet<_> = 
            mcv_list.values.iter().map(|(v, _)| v).collect();
        let non_mcv_values: Vec<_> = non_null_values
            .iter()
            .filter(|v| !mcv_values.contains(v))
            .cloned()
            .collect();

        let histogram = if non_mcv_values.is_empty() {
            // All values are in MCV, create empty histogram
            EquiDepthHistogram {
                num_buckets: 0,
                total_rows: 0,
                buckets: Vec::new(),
            }
        } else {
            EquiDepthHistogram::build(&non_mcv_values, num_buckets)?
        };

        Ok(Self {
            mcv_list,
            histogram,
            null_fraction,
        })
    }

    fn build_mcv_list(values: &[Value], num_mcv: usize) -> MCVList {
        let mut value_counts: std::collections::HashMap<&Value, usize> = 
            std::collections::HashMap::new();

        for value in values {
            *value_counts.entry(value).or_insert(0) += 1;
        }

        let mut mcv: Vec<_> = value_counts.into_iter().collect();
        mcv.sort_by(|a, b| b.1.cmp(&a.1));  // Sort by count descending

        let total_mcv_frequency = mcv.iter().map(|(_, c)| *c as f64).sum::<f64>() / values.len() as f64;

        let mcv_values = mcv
            .into_iter()
            .take(num_mcv)
            .map(|(v, c)| (v.clone(), c as f64 / values.len() as f64))
            .collect();

        MCVList {
            values: mcv_values,
            total_mcv_frequency,
        }
    }

    /// Estimate selectivity with MCV awareness
    pub fn estimate_selectivity(&self, op: &BinaryOperator, value: &Value) -> f64 {
        // Check MCV first
        for (mcv_val, frequency) in &self.mcv_list.values {
            if mcv_val == value {
                return match op {
                    BinaryOperator::Eq => *frequency,
                    BinaryOperator::Neq => 1.0 - frequency,
                    _ => {
                        // For range ops, MCV exact match doesn't help much
                        // Fall through to histogram estimation
                        0.0
                    }
                };
            }
        }

        // Not in MCV, use histogram
        let histogram_selectivity = match op {
            BinaryOperator::Eq => self.histogram.estimate_eq(value),
            BinaryOperator::Gt => self.histogram.estimate_gt(value),
            BinaryOperator::Lt => self.histogram.estimate_lt(value),
            BinaryOperator::Gte => self.histogram.estimate_gte(value),
            BinaryOperator::Lte => 1.0 - self.histogram.estimate_gt(value),
            _ => 0.5,
        };

        // Adjust for non-MCV portion
        let non_mcv_fraction = 1.0 - self.mcv_list.total_mcv_frequency;
        histogram_selectivity * non_mcv_fraction
    }
}

优点: 两全其美 - 对于常见值准确,对于其余值有良好的分布。

缺点: 更复杂,需要更多内存。


3 大型表的采样

问题:完整表扫描很昂贵

-- Table: users (100 million rows)
-- Building histogram requires sorting all values

-- Full scan approach:
SELECT balance FROM users ORDER BY balance;
-- Time: 30 minutes (!)
-- I/O: Read entire table

-- Not practical for routine ANALYZE

解决方案:统计采样

// src/optimizer/sampling.rs
use rand::seq::SliceRandom;
use rand::thread_rng;

pub struct Sampler {
    sample_size: usize,
    confidence: f64,
}

impl Sampler {
    pub fn new(sample_size: usize, confidence: f64) -> Self {
        Self { sample_size, confidence }
    }

    /// Reservoir sampling for streaming data
    pub fn reservoir_sample<T: Clone>(
        &self,
        stream: impl Iterator<Item = T>,
    ) -> Vec<T> {
        let mut reservoir = Vec::with_capacity(self.sample_size);
        let mut count = 0;

        for item in stream {
            count += 1;

            if reservoir.len() < self.sample_size {
                reservoir.push(item);
            } else {
                // Replace with probability sample_size / count
                let j = thread_rng().gen_range(0..count);
                if j < self.sample_size {
                    reservoir[j] = item;
                }
            }
        }

        reservoir
    }

    /// Simple random sampling for in-memory data
    pub fn simple_random_sample<T: Clone>(&self, values: &[T]) -> Vec<T> {
        let mut rng = thread_rng();
        let mut sample: Vec<T> = values.to_vec();
        sample.shuffle(&mut rng);
        sample.truncate(self.sample_size);
        sample
    }

    /// Calculate confidence interval for estimate
    pub fn confidence_interval(&self, sample_proportion: f64, sample_size: usize) -> (f64, f64) {
        // z-score for confidence level (1.96 for 95%)
        let z = match self.confidence {
            0.99 => 2.576,
            0.95 => 1.96,
            0.90 => 1.645,
            _ => 1.96,
        };

        let standard_error = (sample_proportion * (1.0 - sample_proportion) / sample_size as f64).sqrt();
        let margin = z * standard_error;

        (sample_proportion - margin, sample_proportion + margin)
    }
}

// Usage in histogram building
impl EquiDepthHistogram {
    pub fn build_from_sample(
        values: &[Value],
        num_buckets: usize,
        sample_size: usize,
    ) -> Result<Self, HistogramError> {
        let sampler = Sampler::new(sample_size, 0.95);
        let sample = sampler.simple_random_sample(values);

        // Build histogram from sample
        let mut histogram = Self::build(&sample, num_buckets)?;

        // Scale row counts to match full table
        let scale_factor = values.len() as f64 / sample.len() as f64;
        for bucket in &mut histogram.buckets {
            bucket.row_count = (bucket.row_count as f64 * scale_factor).round() as u64;
            bucket.cumulative_count = (bucket.cumulative_count as f64 * scale_factor).round() as u64;
        }
        histogram.total_rows = values.len() as u64;

        Ok(histogram)
    }
}

采样大小指南

表大小 建议采样 准确度 (95% CI)
< 10K rows 100% (full scan) Exact
10K-1M rows 10% ±1%
1M-100M rows 1% ±0.1%
> 100M rows 0.1% or 100K rows ±0.03%

4 多列统计

问题:列相关性

-- Table: users (1,000,000 rows)
-- Columns: country, city

-- Individual column statistics:
-- country: 'US' = 50%, 'UK' = 30%, 'DE' = 20%
-- city: 'London' = 5%, 'Berlin' = 10%, 'Munich' = 5%

-- Query:
SELECT * FROM users WHERE country = 'UK' AND city = 'London'

-- Optimizer (assuming independence):
-- Selectivity = 0.30 × 0.05 = 0.015 = 1.5%
-- Estimated rows: 15,000

-- Reality (London only exists in UK):
-- Actual selectivity = 5% (all London users are in UK)
-- Actual rows: 50,000

-- Error: 3.3x underestimate!

多列直方图

#[derive(Debug, Clone)]
pub struct MultiColumnHistogram {
    pub columns: Vec<String>,
    pub num_buckets: usize,
    pub total_rows: u64,
    pub buckets: Vec<MultiColumnBucket>,
}

#[derive(Debug, Clone)]
pub struct MultiColumnBucket {
    pub bounds: Vec<(Value, Value)>,  // (min, max) for each column
    pub row_count: u64,
    pub distinct_count: u64,
}

impl MultiColumnHistogram {
    pub fn build(
        data: &[Vec<Value>],  // Each row is [col1_value, col2_value, ...]
        columns: Vec<String>,
        num_buckets: usize,
    ) -> Result<Self, HistogramError> {
        if data.is_empty() {
            return Err(HistogramError::EmptyData);
        }

        // Use k-d tree style partitioning for multi-dimensional buckets
        let buckets = Self::build_kd_histogram(data, num_buckets)?;

        Ok(Self {
            columns,
            num_buckets,
            total_rows: data.len() as u64,
            buckets,
        })
    }

    fn build_kd_histogram(
        data: &[Vec<Value>],
        num_buckets: usize,
    ) -> Result<Vec<MultiColumnBucket>, HistogramError> {
        // Simplified: split on dimension with highest variance
        let num_dims = data[0].len();
        let target_bucket_size = data.len() / num_buckets;

        let mut buckets = Vec::new();
        Self::partition_data(data, 0, num_buckets, &mut buckets, target_bucket_size)?;

        Ok(buckets)
    }

    fn partition_data(
        data: &[Vec<Value>],
        depth: usize,
        remaining_buckets: usize,
        buckets: &mut Vec<MultiColumnBucket>,
        target_size: usize,
    ) -> Result<(), HistogramError> {
        if remaining_buckets == 1 || data.len() <= target_size {
            // Create leaf bucket
            let bucket = Self::create_bucket(data)?;
            buckets.push(bucket);
            return Ok(());
        }

        // Split on dimension with highest range
        let dim = depth % data[0].len();
        let mut sorted = data.to_vec();
        sorted.sort_by(|a, b| a[dim].cmp(&b[dim]));

        let mid = sorted.len() / 2;
        let (left, right) = sorted.split_at(mid);

        Self::partition_data(left, depth + 1, remaining_buckets / 2, buckets, target_size)?;
        Self::partition_data(right, depth + 1, remaining_buckets - remaining_buckets / 2, buckets, target_size)?;

        Ok(())
    }

    fn create_bucket(data: &[Vec<Value>]) -> Result<MultiColumnBucket, HistogramError> {
        let num_dims = data[0].len();
        let mut bounds = Vec::new();

        for dim in 0..num_dims {
            let mut values: Vec<_> = data.iter().map(|row| &row[dim]).collect();
            values.sort();
            bounds.push((values.first().unwrap().clone(), values.last().unwrap().clone()));
        }

        let distinct_count = data.iter().collect::<std::collections::HashSet<_>>().len() as u64;

        Ok(MultiColumnBucket {
            bounds,
            row_count: data.len() as u64,
            distinct_count,
        })
    }

    pub fn estimate_selectivity(&self, predicates: &[(usize, BinaryOperator, Value)]) -> f64 {
        let mut selectivity = 0.0;

        for bucket in &self.buckets {
            let bucket_selectivity = self.estimate_bucket_selectivity(bucket, predicates);
            selectivity += bucket_selectivity * (bucket.row_count as f64 / self.total_rows as f64);
        }

        selectivity
    }

    fn estimate_bucket_selectivity(
        &self,
        bucket: &MultiColumnBucket,
        predicates: &[(usize, BinaryOperator, Value)],
    ) -> f64 {
        let mut bucket_selectivity = 1.0;

        for (col_idx, op, value) in predicates {
            if *col_idx >= bucket.bounds.len() {
                continue;
            }

            let (min, max) = &bucket.bounds[*col_idx];

            let col_selectivity = match op {
                BinaryOperator::Eq => {
                    if value >= min && value <= max {
                        1.0 / bucket.distinct_count.max(1) as f64
                    } else {
                        0.0
                    }
                }
                BinaryOperator::Gt => {
                    if value < min {
                        1.0
                    } else if value >= max {
                        0.0
                    } else {
                        let range = max.subtract(min).to_f64();
                        let matching = max.subtract(value).to_f64();
                        (matching / range).clamp(0.0, 1.0)
                    }
                }
                // ... handle other operators
                _ => 0.5,
            };

            bucket_selectivity *= col_selectivity;
        }

        bucket_selectivity
    }
}

何时使用多列统计:

情境 建议
Columns are independent Single-column histograms
Strong correlation (city → country) Multi-column histogram
Frequently used together in WHERE Multi-column histogram
High cardinality combination May not be worth it

5 在优化器中使用直方图

与成本模型集成

// src/optimizer/histogram_selectivity.rs
pub struct HistogramSelectivityEstimator {
    statistics: Arc<StatisticsCatalog>,
}

impl HistogramSelectivityEstimator {
    pub fn new(statistics: Arc<StatisticsCatalog>) -> Self {
        Self { statistics }
    }

    pub fn estimate(&self, table: &str, column: &str, expr: &Expression) -> f64 {
        let table_stats = match self.statistics.get_table(table) {
            Some(stats) => stats,
            None => return 0.5,  // No stats, use default
        };

        let col_stats = match table_stats.columns.get(column) {
            Some(stats) => stats,
            None => return 0.5,
        };

        match &col_stats.histogram {
            Some(Histogram::Compressed(comp)) => {
                match expr {
                    Expression::BinaryOp { op, right, .. } => {
                        comp.estimate_selectivity(op, right)
                    }
                    Expression::InList { list, .. } => {
                        // Sum selectivity for each value
                        list.iter()
                            .map(|v| comp.estimate_selectivity(&BinaryOperator::Eq, v))
                            .sum()
                    }
                    Expression::Between { low, high, .. } => {
                        let low_sel = comp.estimate_selectivity(&BinaryOperator::Gte, low);
                        let high_sel = comp.estimate_selectivity(&BinaryOperator::Lte, high);
                        (low_sel + high_sel - 1.0).max(0.0)
                    }
                    _ => 0.5,
                }
            }
            Some(Histogram::EquiDepth(equi)) => {
                match expr {
                    Expression::BinaryOp { op, right, .. } => {
                        match op {
                            BinaryOperator::Eq => equi.estimate_eq(right),
                            BinaryOperator::Gt => equi.estimate_gt(right),
                            BinaryOperator::Lt => equi.estimate_lt(right),
                            BinaryOperator::Gte => equi.estimate_gte(right),
                            BinaryOperator::Lte => 1.0 - equi.estimate_gt(right),
                            _ => 0.5,
                        }
                    }
                    _ => 0.5,
                }
            }
            None => {
                // No histogram, fall back to MCV or default
                self.estimate_from_mcv(col_stats, expr)
            }
        }
    }

    fn estimate_from_mcv(&self, col_stats: &ColumnStatistics, expr: &Expression) -> f64 {
        match expr {
            Expression::BinaryOp { op: BinaryOperator::Eq, right, .. } => {
                // Check if value is in MCV list
                for (mcv_val, frequency) in &col_stats.most_common_values {
                    if mcv_val == right {
                        return *frequency;
                    }
                }
                // Not in MCV, estimate based on distinct count
                1.0 / col_stats.distinct_count.max(1) as f64
            }
            _ => 0.5,  // Default for other operators
        }
    }
}

// Updated cost model using histograms
impl CostModel {
    pub fn estimate_selectivity_with_histograms(
        &self,
        estimator: &HistogramSelectivityEstimator,
        table: &str,
        expr: &Expression,
    ) -> f64 {
        match expr {
            Expression::BinaryOp { op, left, right, .. } => {
                // Find which side is a column reference
                if let Expression::Identifier(ident) = left.as_ref() {
                    return estimator.estimate(table, &ident.value, &Expression::BinaryOp {
                        op: op.clone(),
                        left: Box::new(Expression::Identifier(ident.clone())),
                        right: right.clone(),
                    });
                }
                // Default estimate
                0.5
            }
            Expression::InList { expr, list, negated } => {
                let base_selectivity = list.iter()
                    .map(|v| estimator.estimate(table, "column", v))
                    .sum::<f64>()
                    .min(1.0);
                if *negated { 1.0 - base_selectivity } else { base_selectivity }
            }
            _ => 0.5,
        }
    }
}

示例:真实查询优化

-- Table statistics (from ANALYZE):
-- users: 1,000,000 rows
-- users.balance: histogram shows 80% have balance < $100

-- Query:
SELECT * FROM users WHERE balance > 100

-- Without histogram:
-- Selectivity estimate: 0.33 (fixed guess)
-- Estimated rows: 333,333
-- Chosen plan: SeqScan (cost: 5000)

-- With histogram:
-- Selectivity estimate: 0.20 (from histogram)
-- Estimated rows: 200,000
-- Chosen plan: IndexScan (cost: 2000) ← Better!

6 直方图维护

何时更新统计

// src/optimizer/stats_maintenance.rs
pub struct StatsMaintenancePolicy {
    auto_analyze_threshold: u64,  // Rows modified before auto-analyze
    auto_analyze_scale_factor: f64,  // Fraction of table
    last_analyze_threshold: chrono::Duration,  // Max time since last analyze
}

impl Default for StatsMaintenancePolicy {
    fn default() -> Self {
        Self {
            auto_analyze_threshold: 50,  // PostgreSQL default
            auto_analyze_scale_factor: 0.2,  // 20% of table
            last_analyze_threshold: chrono::Duration::days(7),
        }
    }
}

pub struct StatsMaintenanceManager {
    catalog: Arc<Catalog>,
    policy: StatsMaintenancePolicy,
    modification_counts: HashMap<String, u64>,  // table → rows modified
}

impl StatsMaintenanceManager {
    pub fn check_and_analyze(&mut self, table: &str) -> Result<bool, AnalyzerError> {
        let table_stats = self.catalog.get_table_statistics(table);
        let modified_rows = self.modification_counts.get(table).copied().unwrap_or(0);

        let needs_analyze = if let Some(stats) = table_stats {
            let threshold = self.policy.auto_analyze_threshold
                + (stats.row_count as f64 * self.policy.auto_analyze_scale_factor) as u64;
            modified_rows >= threshold
        } else {
            true  // No stats, need initial analyze
        };

        if needs_analyze {
            self.analyze_table(table)?;
            self.modification_counts.insert(table.to_string(), 0);
            Ok(true)
        } else {
            Ok(false)
        }
    }

    pub fn record_modification(&mut self, table: &str, rows_affected: u64) {
        let count = self.modification_counts.entry(table.to_string()).or_insert(0);
        *count += rows_affected;
    }

    fn analyze_table(&self, table: &str) -> Result<(), AnalyzerError> {
        // Run ANALYZE on the table
        let analyzer = StatisticsAnalyzer::new(self.catalog.clone());
        let stats = analyzer.analyze_table(table)?;
        self.catalog.store_statistics(&stats)?;
        Ok(())
    }
}

增量直方图更新

问题: 对于大型表,完整重建很昂贵。

解决方案: 对于小变更进行增量更新。

impl EquiDepthHistogram {
    /// Update histogram with new values (for small changes)
    pub fn update(&mut self, new_values: &[Value], deleted_values: &[Value]) {
        // Update total row count
        self.total_rows += new_values.len() as u64;
        self.total_rows -= deleted_values.len() as u64;

        // For each new value, find its bucket and increment count
        for value in new_values {
            for bucket in &mut self.buckets {
                if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                    bucket.row_count += 1;
                    bucket.cumulative_count += 1;
                    break;
                }
            }
        }

        // For deleted values, decrement counts
        for value in deleted_values {
            for bucket in &mut self.buckets {
                if value >= &bucket.lower_bound && value <= &bucket.upper_bound {
                    bucket.row_count = bucket.row_count.saturating_sub(1);
                    bucket.cumulative_count = bucket.cumulative_count.saturating_sub(1);
                    break;
                }
            }
        }

        // If too many changes, trigger full rebuild
        let total_changes = new_values.len() + deleted_values.len();
        if total_changes as f64 / self.total_rows as f64 > 0.1 {
            // Mark for full rebuild
            self.needs_rebuild = true;
        }
    }
}

7 用 Rust 构建的挑战

挑战 1:值比较

问题: SQL 值可以是不同类型(int、float、string、date)。

// ❌ Doesn't work - can't compare different types
pub enum Value {
    Integer(i64),
    Float(f64),
    String(String),
    Date(chrono::NaiveDate),
}

impl Ord for Value {
    fn cmp(&self, other: &Self) -> Ordering {
        // Can't compare Integer with String!
    }
}

解决方案:类型感知比较

// ✅ Works - handle type mismatches
impl Value {
    pub fn compare(&self, other: &Self) -> Result<Ordering, ValueError> {
        match (self, other) {
            (Value::Integer(a), Value::Integer(b)) => Ok(a.cmp(b)),
            (Value::Float(a), Value::Float(b)) => a.partial_cmp(b).ok_or(ValueError::NaN),
            (Value::String(a), Value::String(b)) => Ok(a.cmp(b)),
            (Value::Integer(a), Value::Float(b)) => (*a as f64).partial_cmp(b).ok_or(ValueError::NaN),
            (Value::Float(a), Value::Integer(b)) => a.partial_cmp(&(*b as f64)).ok_or(ValueError::NaN),
            _ => Err(ValueError::TypeMismatch),
        }
    }
}

挑战 2:内存使用

问题: 许多列的直方图可能使用大量内存。

// ❌ Memory explosion
// 1000 tables × 20 columns × 100 buckets × 32 bytes = 64 MB

解决方案:压缩桶边界

// ✅ Compressed representation
pub struct CompressedBucket {
    pub lower_bound_idx: u32,  // Index into shared value pool
    pub upper_bound_idx: u32,
    pub row_count: u32,
}

pub struct CompressedHistogram {
    shared_values: Vec<Value>,  // Deduplicated values
    buckets: Vec<CompressedBucket>,
}

挑战 3:线程安全统计

问题: 统计需要在 ANALYZE 更新期间在查询计划期间读取。

// ❌ Race condition
pub struct StatisticsCatalog {
    tables: HashMap<String, TableStatistics>,
}

解决方案:Arc<RwLock<>> 用于并发访问

// ✅ Thread-safe
pub struct StatisticsCatalog {
    tables: Arc<RwLock<HashMap<String, Arc<TableStatistics>>>>,
}

impl StatisticsCatalog {
    pub fn get_table(&self, name: &str) -> Option<Arc<TableStatistics>> {
        let tables = self.tables.read().unwrap();
        tables.get(name).cloned()
    }

    pub fn update_table(&self, stats: TableStatistics) {
        let mut tables = self.tables.write().unwrap();
        tables.insert(stats.table_name.clone(), Arc::new(stats));
    }
}

8 AI 如何加速这项工作

AI 做对了什么

任务 AI 贡献
直方图类型 解释等宽 vs. 等深权衡
MCV 处理 建议单独 MCV 列表用于常见值
采样 用于流的蓄水池采样算法
选择性公式 桶内范围估计的正确公式

AI 做错了什么

问题 发生什么事
桶边界 初稿在边界计算中有差一错误
多列相关性 最初假设独立(对于相关列错误)
NULL 处理 忘记单独追踪 NULL 分数
增量更新 建议对任何变更进行完整重建(太昂贵)

模式: AI 处理教科书案例良好。边界情况(边界、NULL、增量)需要手动精炼。


示例:调试选择性

我问 AI 的问题:

“直方图显示 20% 的行有 balance > 100,但优化器估计 5%。为什么?”

我学到的:

  1. MCV 列表没有被首先检查
  2. 缺少非 MCV 分数调整
  3. 桶内范围估计颠倒了

结果: 修复选择性估计:

pub fn estimate_selectivity(&self, op: &BinaryOperator, value: &Value) -> f64 {
    // Check MCV first
    for (mcv_val, frequency) in &self.mcv_list.values {
        if mcv_val == value {
            return match op {
                BinaryOperator::Eq => *frequency,
                BinaryOperator::Neq => 1.0 - frequency,
                _ => 0.0  // Fall through for range ops
            };
        }
    }

    // Use histogram for non-MCV values
    let histogram_sel = self.histogram.estimate_gt(value);
    let non_mcv_fraction = 1.0 - self.mcv_list.total_mcv_frequency;
    histogram_sel * non_mcv_fraction  // ← Was missing this!
}

总结:直方图统计一张图

flowchart TD subgraph "Data Collection" A[Table Data] --> B{Sample or Full?} B -->|Small table| C[Full Scan] B -->|Large table| D[Reservoir Sampling] end subgraph "Histogram Building" C --> E[Sort Values] D --> E E --> F{Build Type?} F -->|Uniform data| G[Equi-Width] F -->|Skewed data| H[Equi-Depth] F -->|With duplicates| I[Compressed + MCV] end subgraph "Multi-Column" J[Correlated Columns] --> K[Multi-Dimensional Buckets] end subgraph "Usage in Optimizer" H --> L[Selectivity Estimation] I --> L K --> L L --> M[Cost Calculation] M --> N[Plan Selection] end subgraph "Maintenance" O[INSERT/UPDATE/DELETE] --> P[Track Modifications] P --> Q{Threshold Reached?} Q -->|Yes| R[Auto-ANALYZE] Q -->|No| S[Incremental Update] R --> E end style A fill:#e3f2fd,stroke:#1976d2 style N fill:#e8f5e9,stroke:#388e3c style L fill:#fff3e0,stroke:#f57c00

关键要点:

概念 为什么重要
等深直方图 对于偏斜数据比等宽更好
MCV 列表 对于常见值准确
采样 使 ANALYZE 对于大型表实用
多列统计 捕捉列相关性
增量更新 避免对小变更进行完整重建
线程安全访问 允许更新期间并发读取

进一步阅读:

分享到