अभिव्यक्ति

Building a simple Query Optimizer from scratch: Part 4

Cost Modeling and Cardinality Estimation

This series

Code for this series

This is Part 4 in a 6-part series on building a query optimizer from scratch in Java. Part 3 built the rule engine and optimization rules.

In the previous post, we built a rule engine that transforms plans using structural pattern matching. Predicate pushdown, projection pushdown, and filter merge are all "always good" rules - they produce equivalent plans that are never worse and usually better. But the join reorder rule was different. It needed to compare two alternative plans and pick the cheaper one. That comparison requires a cost model.

A cost model answers the question: given this plan, roughly how expensive will it be to execute? The answer depends on two things - how many rows flow through each operator (cardinality estimation) and how much work each operator does per row (cost formulas). Together, they let the optimizer reason quantitatively about plan quality instead of relying solely on structural heuristics.

CostConfig: the knobs

Before looking at cost formulas, let's look at the constants they use. These are collected in a CostConfig object that makes them easy to tune:

class CostConfig {
    public double PAGE_COST = 1.0;        // Cost to read one page from disk
    public double TUPLE_COST = 0.01;      // Cost to process one tuple in memory
    public int PAGE_SIZE = 100;           // Tuples per page
    public double COMPARISON_COST = 0.001; // Cost of one comparison operation
    public double HASH_COST = 0.005;      // Cost to hash one tuple
}

Each constant models a different kind of work. PAGE_COST represents I/O - reading a page from disk (or from the buffer pool). TUPLE_COST is the CPU cost of touching a row. COMPARISON_COST is the cost of evaluating a predicate or join condition against a row. HASH_COST is the cost of hashing a tuple into a hash table (relevant for hash joins and hash aggregations).

The default values are somewhat arbitrary - they're rough ratios, not calibrated measurements. But their relative magnitudes are what matter. PAGE_COST being 100x TUPLE_COST says "I/O is much more expensive than CPU work," which is the right relationship for disk-based databases. On fast SSDs, you might shrink that ratio. On spinning disks, you might widen it.

Making these configurable isn't just for correctness - it's a teaching tool. You can change PAGE_COST and watch the optimizer change its plan choices. We'll see an example of this at the end of the post.

SimpleCostModel: cost formulas

The SimpleCostModel implements the CostModel interface by dispatching on operator type:

public class SimpleCostModel implements CostModel {
    private final Catalog catalog;
    private final CostConfig costConfig;
    private final CardinalityEstimator cardinalityEstimator;

    public SimpleCostModel(Catalog catalog, CostConfig costConfig) {
        this.catalog = catalog;
        this.costConfig = costConfig;
        this.cardinalityEstimator = new CardinalityEstimator(catalog);
    }

    @Override
    public double estimate(LogicalNode node) {
        long cardinality = estimateCardinality(node);
        node.setEstimatedRows(cardinality);

        return switch (node) {
            case LogicalScan scan       -> estimateScanCost(scan);
            case LogicalFilter filter   -> estimateFilterCost(filter);
            case LogicalProject project -> estimateProjectCost(project);
            case LogicalJoin join       -> estimateJoinCost(join);
            case LogicalAggregate agg   -> estimateAggregateCost(agg);
            default -> 1000000.0;
        };
    }
}

Notice that estimate first computes cardinality and annotates the node with it. This means by the time the cost formulas run, every node already knows its estimated row count. The formulas are cumulative - each includes the cost of producing its input plus its own local work.

Scan cost

A table scan reads every page from disk and processes every row:

private double estimateScanCost(LogicalScan scan) {
    TableMetadata table = catalog.getTableMetadata(scan.getTableName());
    long rows = table.getRowCount();
    long pages = (rows + costConfig.PAGE_SIZE - 1) / costConfig.PAGE_SIZE;

    double ioCost = pages * costConfig.PAGE_COST;
    double cpuCost = rows * costConfig.TUPLE_COST;

    return ioCost + cpuCost;
}

The page count is computed by dividing the row count by PAGE_SIZE and rounding up. For a table with 1000 rows and PAGE_SIZE = 100, that's 10 pages. I/O cost is 10 * 1.0 = 10.0. CPU cost is 1000 * 0.01 = 10.0. Total: 20.0.

This is the simplest formula in the system. It models a sequential scan - every page, every row. If we added index support, an index scan would have a different formula: fewer pages read (only the ones containing matching rows), but potentially random I/O (more expensive per page).

Filter cost

Filtering reads its input and evaluates a predicate against every incoming row:

private double estimateFilterCost(LogicalFilter filter) {
    double childCost = estimate(filter.getChild());
    long inputRows = filter.getChild().getEstimatedRows();
    double filterCost = inputRows * costConfig.COMPARISON_COST;

    return childCost + filterCost;
}

The cost is the child's cost (producing the input) plus the cost of evaluating the predicate once per input row. The key thing to notice: the filter's own output cardinality doesn't appear in its cost formula. It doesn't matter how many rows pass the filter - what matters is how many rows the filter has to examine. But the output cardinality matters a lot to the parent operator, which is why cardinality estimation is so important.

Project cost

Projection copies selected columns from each input row:

private double estimateProjectCost(LogicalProject project) {
    double childCost = estimate(project.getChild());
    long inputRows = project.getChild().getEstimatedRows();
    double projectCost = inputRows * costConfig.TUPLE_COST;

    return childCost + projectCost;
}

This is a cheap operation - just TUPLE_COST per row. In a real system with column-oriented storage, projection can be nearly free if the selected columns are stored contiguously. In a row-store, it involves copying fields out of each row.

Join cost

This is where costs get interesting. We model nested loop join cost:

private double estimateJoinCost(LogicalJoin join) {
    double leftCost = estimate(join.getLeft());
    double rightCost = estimate(join.getRight());
    long leftRows = join.getLeft().getEstimatedRows();
    long rightRows = join.getRight().getEstimatedRows();

    double joinCost = leftRows * (rightRows * costConfig.COMPARISON_COST);

    return leftCost + rightCost + joinCost;
}

The formula: produce both inputs (leftCost + rightCost), then for every row on the left, compare against every row on the right (leftRows * rightRows * COMPARISON_COST). This is O(n²) - the hallmark of nested loop join.

The quadratic term is why predicate pushdown matters so much. If the left input has 1000 rows and the right has 500, the join examines 500,000 row pairs. But if a filter reduces the left to 100 rows, it drops to 50,000 - a 10x improvement. The cost model quantifies this improvement, which is what the join reorder rule uses to decide whether swapping inputs is worthwhile.

The code comments note that a hash join would have a different formula: (leftRows + rightRows) * HASH_COST - linear instead of quadratic, because the hash join builds a hash table on one side and probes it with the other. We'll see this difference in action when we implement physical operators in Part 6.

Aggregate cost

Aggregation hashes input rows into groups and computes aggregate values:

private double estimateAggregateCost(LogicalAggregate aggregate) {
    double childCost = estimate(aggregate.getChild());
    long inputRows = aggregate.getChild().getEstimatedRows();
    long outputRows = estimateCardinality(aggregate);

    double aggregationCost = inputRows * costConfig.HASH_COST
                           + outputRows * costConfig.TUPLE_COST;

    return childCost + aggregationCost;
}

The hash cost comes from inserting each input row into the hash table (bucketed by the GROUP BY key). The tuple cost comes from materializing the output rows (one per group).

The recursive pattern

All five formulas follow the same pattern: cost(node) = cost(children) + local_work(node). This means estimate is implicitly recursive - calling estimate(join) calls estimate(join.getLeft()) and estimate(join.getRight()), which may themselves be joins or filters or scans. The total cost at the root represents the end-to-end cost of the entire plan. This recursive structure mirrors how the plan will actually execute: the root operator pulls from its children, which pull from their children, all the way down to the scans.

CardinalityEstimator: predicting row counts

Cost formulas need to know how many rows flow through each operator. That's the cardinality estimator's job. It dispatches on operator type, just like the cost model:

public class CardinalityEstimator {
    private final Catalog catalog;

    public long estimate(LogicalNode node) {
        return switch (node) {
            case LogicalScan scan       -> estimateScan(scan);
            case LogicalFilter filter   -> estimateFilter(filter);
            case LogicalProject project -> estimateProject(project);
            case LogicalJoin join       -> estimateJoin(join);
            case LogicalAggregate agg   -> estimateAggregate(agg);
            default -> 1000;
        };
    }
}

Scan cardinality

The simplest case - a scan produces as many rows as the table has:

private long estimateScan(LogicalScan scan) {
    TableMetadata table = catalog.getTableMetadata(scan.getTableName());
    return table.getRowCount();
}

Filter cardinality

A filter reduces its input by a factor called selectivity:

private long estimateFilter(LogicalFilter filter) {
    long inputRows = estimate(filter.getChild());
    double selectivity = estimateSelectivity(filter.getPredicate(), filter.getChild());
    return Math.max(1, (long) (inputRows * selectivity));
}

A selectivity of 0.1 means "10% of rows pass the filter." If the input is 1000 rows and selectivity is 0.1, the filter outputs 100 rows. The Math.max(1, ...) prevents zero-row estimates, which would break downstream cost calculations (a join with zero rows on one side would appear free).

Project cardinality

Projection doesn't change row counts - it only changes the width of each row:

private long estimateProject(LogicalProject project) {
    return estimate(project.getChild());
}

Join cardinality

This is where cardinality estimation gets interesting. For an equality join (a.id = b.fk), there's a standard formula from database theory:

|A ⋈ B| = (|A| × |B|) / max(NDV(a.id), NDV(b.fk))

The intuition: if a.id has 100 distinct values and b.fk has 50, then on average each fk value matches |A| / NDV(a.id) rows from A. Multiplied across all rows of B, you get the formula above.

private long estimateJoin(LogicalJoin join) {
    long leftRows = estimate(join.getLeft());
    long rightRows = estimate(join.getRight());

    Expression condition = join.getCondition();
    if (isEqualityJoin(condition)) {
        long ndv = estimateJoinNDV(condition, join.getLeft(), join.getRight());
        if (ndv > 0) {
            return Math.max(1, (leftRows * rightRows) / ndv);
        }
    }

    // Fallback for non-equality joins: assume 10% selectivity
    return Math.max(1, (long) (leftRows * rightRows * 0.1));
}

estimateJoinNDV extracts the column references from both sides of the equality, finds the NDV for each via the catalog, and returns the maximum. Using max(NDV_left, NDV_right) in the denominator is the standard heuristic - it works well when one column is a primary key (NDV = row count) and the other is a foreign key.

For non-equality joins (which we rarely encounter in our restricted SQL), the fallback is a flat 10% selectivity on the cross product. This is conservative but avoids wildly wrong estimates.

private long estimateJoinNDV(Expression condition,
                             LogicalNode left, LogicalNode right) {
    if (!(condition instanceof Expression.BinaryOp binary)) return 0;

    Expression.ColumnRef leftCol = null;
    Expression.ColumnRef rightCol = null;

    if (binary.left() instanceof Expression.ColumnRef)
        leftCol = (Expression.ColumnRef) binary.left();
    if (binary.right() instanceof Expression.ColumnRef)
        rightCol = (Expression.ColumnRef) binary.right();

    if (leftCol == null || rightCol == null) return 0;

    long leftNDV = getColumnNDV(leftCol, left);
    long rightNDV = getColumnNDV(rightCol, right);

    return Math.max(leftNDV, rightNDV);
}

Aggregate cardinality

Aggregation reduces rows to the number of distinct groups:

private long estimateAggregate(LogicalAggregate aggregate) {
    long inputRows = estimate(aggregate.getChild());

    if (aggregate.getGroupByColumns().isEmpty()) {
        return 1;  // No GROUP BY - single output row
    }

    // Heuristic: assume 10% of input rows are distinct groups
    return Math.max(1, Math.min(inputRows, inputRows / 10));
}

With no GROUP BY, there's exactly one output row (the aggregate over the entire input). With GROUP BY, we use a heuristic - 10% of the input size. A more accurate approach would multiply the NDVs of the group-by columns, but for our purposes the heuristic works.

Selectivity estimation

Selectivity is the heart of cardinality estimation. It answers: what fraction of input rows will satisfy a given predicate? The estimator handles different predicate types with different heuristics.

Equality predicates: 1/NDV

For column = value, the classic estimate is 1 divided by the number of distinct values:

private double estimateEqualitySelectivity(
        Expression.BinaryOp predicate, LogicalNode input) {
    Expression.ColumnRef column = null;

    if (predicate.left() instanceof Expression.ColumnRef)
        column = (Expression.ColumnRef) predicate.left();
    else if (predicate.right() instanceof Expression.ColumnRef)
        column = (Expression.ColumnRef) predicate.right();

    if (column != null) {
        String tableName = findTableForColumn(column, input);
        if (tableName != null) {
            TableMetadata table = catalog.getTableMetadata(tableName);
            ColumnStats stats = table.getColumnStats(column.columnName());
            if (stats != null && stats.numDistinctValues() > 0) {
                return 1.0 / stats.numDistinctValues();
            }
        }
    }

    return 0.1; // Default when stats unavailable
}

If a column has 5 distinct values, equality selectivity is 0.2. This assumes uniform distribution - each value appears equally often. That's often wrong (some cities have more customers than others), but it's the best we can do without histograms. (We'll add histograms in Part 5.)

The method checks both sides of the equality for a column reference, because the predicate could be city = 'Seattle' or 'Seattle' = city - both are valid SQL.

Range predicates: the 0.33 heuristic

For column > value, column < value, and similar:

private double estimateRangeSelectivity(
        Expression.BinaryOp predicate, LogicalNode input,
        boolean inclusive, boolean greaterThan) {
    Expression.ColumnRef column = null;
    Object value = null;

    // Extract column and literal from either side
    if (predicate.left() instanceof Expression.ColumnRef) {
        column = (Expression.ColumnRef) predicate.left();
        if (predicate.right() instanceof Expression.Literal)
            value = ((Expression.Literal<?>) predicate.right()).value();
    } else if (predicate.right() instanceof Expression.ColumnRef) {
        column = (Expression.ColumnRef) predicate.right();
        if (predicate.left() instanceof Expression.Literal)
            value = ((Expression.Literal<?>) predicate.left()).value();
        greaterThan = !greaterThan; // Flip direction
    }

    if (column != null && value != null) {
        String tableName = findTableForColumn(column, input);
        if (tableName != null) {
            TableMetadata table = catalog.getTableMetadata(tableName);
            Histogram<?> histogram = table.getHistogram(column.columnName());
            if (histogram != null) {
                return greaterThan
                    ? histogram.estimateGreaterThan(value, inclusive)
                    : histogram.estimateLessThan(value, inclusive);
            }
        }
    }

    return 0.33; // Default heuristic
}

Without histograms, we default to 0.33 - "roughly a third of the rows." This is a well-known heuristic used by many database systems. It's crude but avoids the extremes: 0.5 would overestimate too often, and 0.1 would underestimate. The method already has the histogram integration path - it checks for a histogram first and uses it if available. We'll implement histograms in the next post.

There's a subtle detail in the column-on-right case: when the predicate is 100 < price (instead of the more natural price > 100), the method flips the greaterThan flag to account for the reversed comparison direction.

Compound predicates: AND and OR

For AND, selectivities multiply (independence assumption):

case AND:
    double leftSel = estimateSelectivity(binary.left(), input);
    double rightSel = estimateSelectivity(binary.right(), input);
    return leftSel * rightSel;

For OR, the inclusion-exclusion formula:

case OR:
    double leftSelOr = estimateSelectivity(binary.left(), input);
    double rightSelOr = estimateSelectivity(binary.right(), input);
    return leftSelOr + rightSelOr - (leftSelOr * rightSelOr);

The AND formula assumes predicates are independent - knowing that city = 'Seattle' doesn't change the probability that age > 30. This is often wrong (cities correlate with age demographics), but it's simple and widely used. The OR formula correctly avoids double-counting rows that satisfy both predicates.

In canonical form, AND rarely appears in filter predicates (they've been split into separate filters). But it can appear after filter merge recombines them, or in join conditions.

Finding which table a column belongs to

Several estimation methods need to resolve a column reference to a table - findTableForColumn walks the plan tree to find a LogicalScan whose table contains the named column:

private String findTableForColumn(Expression.ColumnRef column, LogicalNode node) {
    if (node instanceof LogicalScan scan) {
        var table = catalog.getTableMetadata(scan.getTableName());
        if (table.getSchema().hasColumn(column.columnName())) {
            return scan.getTableName();
        }
    }
    for (LogicalNode child : node.getChildren()) {
        String tableName = findTableForColumn(column, child);
        if (tableName != null) return tableName;
    }
    return null;
}

This is necessary because column references in expressions may or may not be table-qualified. Even when they are qualified (e.g., c.city), the qualifier is the alias, not the table name. The method resolves this by checking which scan node's schema contains the column.

Cost propagation through the tree

When the demo annotates a plan with costs, it uses a post-order traversal - children before parents:

private static void annotatePlanWithCosts(LogicalNode node, CostModel costModel) {
    for (LogicalNode child : node.getChildren()) {
        annotatePlanWithCosts(child, costModel);
    }
    node.setEstimatedRows(costModel.estimateCardinality(node));
    node.setEstimatedCost(costModel.estimate(node));
}

After this runs, every node in the tree carries its estimated row count and cumulative cost. The pretty printer displays these annotations:

BEFORE Optimization:
└── Project[name, total] [rows=2, cost=0.11]
    └── Filter[(o.total > 100)] [rows=2, cost=0.10]
        └── Filter[(c.city = 'Seattle')] [rows=8, cost=0.10]
            └── Join[INNER, (c.id = o.customer_id)] [rows=10, cost=0.10]
                ├── Scan[customers] [rows=8, cost=1.08]
                └── Scan[orders] [rows=10, cost=1.10]

AFTER Optimization:
└── Project[name, total] [rows=2, cost=0.03]
    └── Join[INNER, (c.id = o.customer_id)] [rows=6, cost=0.03]
        ├── Filter[(c.city = 'Seattle')] [rows=2, cost=1.09]
        │   └── Scan[customers] [rows=8, cost=1.08]
        └── Filter[(o.total > 100)] [rows=3, cost=1.11]
            └── Scan[orders] [rows=10, cost=1.10]

The cost drop at the join is dramatic. Before optimization, the join sees all 8 customers × 10 orders = 80 pair comparisons. After pushdown, it sees 2 filtered customers × 3 filtered orders = 6 pair comparisons. The cost model quantifies exactly this difference.

Demonstrating cost sensitivity

One of the benefits of configurable cost constants is the ability to show how different hardware assumptions change optimizer behavior. The demo includes a cost sensitivity analysis:

// Default configuration
CostModel defaultModel = new SimpleCostModel(catalog);

// Simulate expensive I/O (spinning disks)
CostModel.CostConfig expensiveIO = new CostModel.CostConfig();
expensiveIO.PAGE_COST = 2.0;
CostModel expensiveIOModel = new SimpleCostModel(catalog, expensiveIO);

// Simulate expensive CPU
CostModel.CostConfig expensiveCPU = new CostModel.CostConfig();
expensiveCPU.TUPLE_COST = 0.02;
CostModel expensiveCPUModel = new SimpleCostModel(catalog, expensiveCPU);

With PAGE_COST = 2.0, scan costs double, making the optimizer favor plans that read fewer pages. With TUPLE_COST = 0.02, CPU costs double, making the optimizer favor plans that process fewer tuples. In a system with both index scans and full table scans, the PAGE_COST ratio would directly influence which access path the optimizer chooses.

For our current system (which only has full table scans), the effect is more visible on join ordering: higher COMPARISON_COST makes the optimizer more aggressive about putting smaller relations on the inner side of nested loop joins.

What production systems do differently

Our cost model captures the essential ideas, but production systems invest much more sophistication in several areas.

Multi-dimensional costs. Real optimizers often track multiple cost dimensions separately - CPU time, I/O time, memory usage, network transfer (for distributed queries). A plan might be cheap on CPU but expensive on I/O, and the optimizer needs to balance these. Our single scalar cost simplifies this trade-off away.

Memory-aware costing. Hash join cost depends heavily on whether the build side fits in memory. If it does, the cost is linear. If it doesn't, the algorithm spills to disk, and cost jumps dramatically. Our cost model doesn't model memory at all.

Correlated predicates. Our AND selectivity formula assumes independence between predicates. Real data often violates this - state = 'California' AND city = 'San Francisco' are highly correlated. Production systems may use multi-column statistics or column group statistics to handle this.

Estimation error propagation. Cardinality estimation errors compound through the plan tree. If a filter's selectivity estimate is off by 2x, the join above it will have a 2x error in its input size, which translates to a 4x error in join cost (because nested loop is quadratic). Production systems sometimes use techniques like estimation sanity checks, plan robustness analysis, or adaptive query execution to cope with this.

Histograms. The 0.33 default for range predicates is a blunt instrument. Histograms provide much better estimates by capturing the actual data distribution. We'll implement equi-depth histograms in the next post, which will dramatically improve range predicate estimation.

What's next

The cost model and cardinality estimator are functional, but we've seen their limitations - particularly the 0.33 default for range predicates and the greedy join ordering heuristic. In the next post, we'll address both: equi-depth histograms for dramatically better selectivity estimation, dynamic programming join ordering (the System R algorithm) for optimal join plans, and hardware-aware cost calibration to make cost constants meaningful.

Next up: Part 5 - Going Deeper: Histograms, DP Join Ordering, and Cost Calibration