Building a simple Query Optimizer from scratch: Part 1
Architecture & Foundation
This series
- Part 1: Designing a Simple Query Optimizer: Architecture & Foundation (this post)
- Part 2: From SQL to Canonical Logical Plans
- Part 3: A Composable Rule Engine and Logical Rewrites
- Part 4: Cost Modeling and Cardinality Estimation
- Part 5: Going Deeper: Histograms, DP Join Ordering, and Cost Calibration
- Part 6: Physical Execution: From Plans to Results
Every time you write a SQL query, something remarkable happens before a single row is read. The database's query optimizer takes your declarative request - what data you want - and figures out how to get it efficiently. Should it scan the whole table or use an index? Which table should it read first in a join? Can it filter rows earlier to avoid unnecessary work? These decisions can mean the difference between a query that finishes in milliseconds and one that takes hours.
In this series, we'll build a query optimizer from scratch in Java. Not a toy that optimizes nothing, and not a production system with every bell and whistle - something in between. A system with clean architecture, real optimization rules, a cost model that makes meaningful choices, and an execution engine that runs queries end-to-end.
By the end of the series, you'll understand how the major pieces of a query optimizer fit together, and you'll have a codebase you can extend in any direction that interests you.
What we're building
Here is the pipeline that a SQL query will flow through in our optimizer:
SQL String
|
v
+---------+
| Parser | --> Abstract Syntax Tree (AST)
+---------+
|
v
+-------------------+
| LogicalPlanBuilder│ --> Logical Plan (relational algebra tree)
+-------------------+
|
v
+------------------+
| Rule Engine | --> Optimized Logical Plan
| (+ Cost Model) |
+------------------+
|
v
+--------------------+
| PhysicalPlanBuilder| --> Physical Plan (with algorithm choices)
+--------------------+
|
v
+--------------+
| Executor | --> Result Tuples
+--------------+
Supporting all of this is a Catalog - the central registry of table schemas, row data, and column statistics that every other component consults.
Each layer has a clear responsibility. The parser turns SQL text into a structured tree. The logical planner converts that tree into relational algebra operators. The rule engine transforms those operators into more efficient arrangements. The physical planner picks concrete algorithms (nested loop join vs. hash join, for example). And the executor actually runs the plan and returns rows.
This layering isn't accidental. It's how production systems like Apache Calcite, PostgreSQL, and CockroachDB are organized, because each concern is different enough to warrant its own abstraction. In this first post, we'll build the foundation that all the other layers rest on: the catalog, the expression system, and the core interfaces.
Package layout
The project is organized into packages that mirror the pipeline:
src/main/java/org/query/optimizer/
├── catalog/ Schema, TableMetadata, ColumnStats, DataType
├── parser/ SQL parser, AST classes, LogicalPlanBuilder
├── logical/ Expression, LogicalNode (abstract base)
├── rules/ PredicatePushdown, ProjectionPushdown, ...
├── physical/ PhysicalNode, PhysicalScan, PhysicalHashJoin, ...
├── executor/ Iterator interface, Executor
├── Rule.java Optimization rule interface
├── RuleEngine.java Fixpoint rule application
├── SimpleCostModel.java
├── CardinalityEstimator.java
└── ...
The catalog/ package knows nothing about optimization. The rules/ package knows nothing about physical execution. Each package depends on the layers below it but not above. This means you could swap out the rule engine without touching the parser, or plug in a different cost model without changing the executor.
The Catalog: where metadata lives
Before we can optimize anything, we need to know what data looks like. How many rows does a table have? How many distinct values are in a column? What's the minimum and maximum? The catalog is the single source of truth for all of this.
Data types and schemas
We support three data types - enough to be useful without drowning in type-system complexity:
public enum DataType {
INTEGER,
FLOAT,
VARCHAR;
public Object parse(String value) {
if (value == null || value.trim().isEmpty()) return null;
return switch (this) {
case INTEGER -> Integer.parseInt(value.trim());
case FLOAT -> Float.parseFloat(value.trim());
case VARCHAR -> value.trim();
};
}
}
A schema is an ordered list of named, typed columns:
public class Schema {
public record Column(String name, DataType type) {}
private final List<Column> columns;
private final Map<String, Integer> columnIndex;
public Schema(List<Column> columns) {
this.columns = List.copyOf(columns);
this.columnIndex = new HashMap<>();
for (int i = 0; i < columns.size(); i++) {
columnIndex.put(columns.get(i).name().toLowerCase(), i);
}
}
public Column getColumn(String name) { /* lookup by name */ }
public boolean hasColumn(String name) { /* existence check */ }
public int columnCount() { return columns.size(); }
}
The columnIndex map gives us O(1) column lookups by name, which we'll use constantly - in expression evaluation, predicate pushdown, schema resolution, and more.
Column statistics
For the optimizer to make good decisions, it needs to know something about the data distribution. We capture four things per column:
public record ColumnStats(
String columnName,
long numDistinctValues, // NDV - key for equality selectivity
Object minValue,
Object maxValue,
long numNulls
) {
public static class Builder {
// fluent builder for constructing stats
}
}
The NDV (number of distinct values) is especially important. If a column has 5 distinct values and you filter on equality (WHERE city = 'Seattle'), the optimizer can estimate that roughly 1/5 of the rows will match. That's a crude but surprisingly useful heuristic, and it's what production systems fall back on when they don't have histogram data.
TableMetadata
Each table bundles its schema, statistics, and the actual row data together:
public class TableMetadata {
private final String tableName;
private final Schema schema;
private final long rowCount;
private final Map<String, ColumnStats> columnStats;
private final List<Map<Schema.Column, Object>> data;
// ... constructor, getters ...
public void addColumnStats(ColumnStats stats) {
columnStats.put(stats.columnName().toLowerCase(), stats);
}
public ColumnStats getColumnStats(String columnName) {
return columnStats.get(columnName.toLowerCase());
}
}
Storing the actual data in-memory alongside the metadata is a deliberate simplification. A real database would have a storage engine with pages, buffer pools, and disk I/O. For our purposes, having the data right there lets us focus on the optimizer itself without building an entire storage layer.
The Catalog class
The Catalog ties everything together. It's a registry of tables and the entry point for loading data:
public class Catalog {
private final Map<String, TableMetadata> tables;
public Catalog() {
this.tables = new HashMap<>();
}
public TableMetadata loadTableFromCSV(String tableName, String csvPath)
throws IOException {
// 1. Parse header: "id:INTEGER,name:VARCHAR,price:FLOAT"
// 2. Parse data rows
// 3. Create TableMetadata
// 4. Collect statistics
// 5. Register in catalog
}
public TableMetadata getTableMetadata(String tableName) {
TableMetadata table = tables.get(tableName.toLowerCase());
if (table == null) {
throw new IllegalArgumentException("Table not found: " + tableName);
}
return table;
}
}
The CSV format uses a typed header - id:INTEGER,name:VARCHAR,price:FLOAT - so the catalog can automatically build a schema and parse values into the correct Java types. After loading rows, the catalog computes column statistics by making a single pass over the data, collecting distinct values, min/max values, and null counts for every column.
Here's the core of the statistics collection:
private void collectStatistics(TableMetadata table) {
for (int colIdx = 0; colIdx < schema.columnCount(); colIdx++) {
Schema.Column column = schema.getColumn(colIdx);
Set<Object> distinctValues = new HashSet<>();
Object minValue = null, maxValue = null;
long nullCount = 0;
for (Map<Schema.Column, Object> row : data) {
Object value = row.get(column);
if (value == null) { nullCount++; continue; }
distinctValues.add(value);
// update min/max based on column type...
}
ColumnStats stats = new ColumnStats.Builder(column.name())
.setNumDistinctValues(distinctValues.size())
.setMinValue(minValue)
.setMaxValue(maxValue)
.setNumNulls(nullCount)
.build();
table.addColumnStats(stats);
}
}
This is straightforward: one pass, one HashSet for distinct values, and manual min/max tracking. It's the same idea behind ANALYZE in PostgreSQL or COMPUTE STATISTICS in SQL Server - compute summaries of the data so the optimizer doesn't have to look at actual rows during planning.
The Expression system
Expressions appear everywhere in a query optimizer. The WHERE clause is an expression. Join conditions are expressions. The SELECT list contains expressions. Rather than having each layer define its own representation, we define a single Expression interface that's shared across logical plans, physical plans, and the executor.
public interface Expression {
Object evaluate(Tuple row, Schema schema);
String toSQLString();
}
Every expression can do two things: evaluate itself against a row to produce a value, and render itself as a SQL string for debugging. There are three concrete kinds:
Column references
record ColumnRef(String tableName, String columnName) implements Expression {
public static ColumnRef from(String columnName) {
return new ColumnRef(null, columnName);
}
@Override
public Object evaluate(Tuple row, Schema schema) {
return row.find(schema.getColumn(columnName));
}
@Override
public String toSQLString() {
return tableName != null ? tableName + "." + columnName : columnName;
}
}
A ColumnRef can be qualified (customers.id) or unqualified (id). Qualified references are essential once joins introduce multiple tables - the optimizer needs to know which table's id column a predicate refers to in order to decide where to push it.
Literals
record Literal<T extends Comparable<? super T>>(T value) implements Expression {
@Override
public Object evaluate(Tuple row, Schema schema) {
return value;
}
@Override
public String toSQLString() {
if (value instanceof String) return "'" + value + "'";
return value.toString();
}
}
Binary operations
record BinaryOp(Operator operator, Expression left, Expression right)
implements Expression {
public enum Operator {
EQ("="), NEQ("!="), GT(">"), GTE(">="), LT("<"), LTE("<="),
AND("AND"), OR("OR");
// ...
}
@Override
public Object evaluate(Tuple row, Schema schema) {
Object leftVal = left.evaluate(row, schema);
Object rightVal = right.evaluate(row, schema);
return switch (operator()) {
case EQ -> leftVal.equals(rightVal);
case GT -> compare(leftVal, rightVal) > 0;
case AND -> (Boolean) leftVal && (Boolean) rightVal;
// ... other operators
};
}
}
BinaryOp handles both comparisons (price > 100) and logical connectives (AND, OR). This means a complex predicate like category = 'Electronics' AND price > 100 is simply a tree of BinaryOp nodes:
AND
/ \
EQ GT
/ \ / \
category price 100
'Electronics'
The evaluate method is recursive: evaluating an AND evaluates both children first, then combines the results. This tree structure will become important when the optimizer needs to analyze predicates - for instance, determining which tables a predicate references to decide if it can be pushed below a join.
Using Java records for all three types gives us immutability, automatic equals/hashCode, and destructuring in pattern matches - all useful properties when transforming plan trees.
Core interfaces
With the catalog and expression system in place, we can now define the contracts that the rest of the optimizer will be built around. These are deliberately thin - each one captures a single idea.
LogicalNode
The base class for all logical plan operators. Logical plans represent the what of a query - scan this table, filter on this predicate, join these two inputs - without specifying how to do it.
public abstract class LogicalNode {
private final Map<String, Object> annotations = new HashMap<>();
public abstract List<LogicalNode> getChildren();
public abstract LogicalNode withChildren(List<LogicalNode> children);
public abstract String describe();
/* --- Annotations for optimizer metadata --- */
public void setEstimatedRows(long rows) {
annotations.put("est_rows", rows);
}
public long getEstimatedRows() {
Object val = annotations.get("est_rows");
return val != null ? (Long) val : -1;
}
public void setEstimatedCost(double cost) {
annotations.put("est_cost", cost);
}
// ... more annotation methods ...
}
Three things to note about this design:
withChildren enables immutable transformations. The optimizer doesn't mutate plan trees in place. When a rule wants to push a filter below a join, it creates new nodes with rearranged children. withChildren makes this easy - you build new subtrees and hand them back. This means you can always compare the plan before and after optimization without worrying about aliasing.
Annotations carry optimizer metadata. Estimated row counts, estimated costs, selectivity - all of this gets attached to plan nodes as the optimizer works. The key insight is that this metadata doesn't belong in the node's constructor. A LogicalFilter shouldn't require a cost estimate to be created - that estimate only becomes available later, when the cost model runs. Annotations are the "write later, read later" solution.
Pretty printing uses annotations. The toPrettyString method walks the tree and renders each node with its annotations:
public String toPrettyString() {
StringBuilder sb = new StringBuilder();
toPrettyString(sb, "", true);
return sb.toString();
}
private void toPrettyString(StringBuilder sb, String prefix, boolean isTail) {
sb.append(prefix).append(isTail ? "└── " : "├── ");
sb.append(describe());
if (hasAnnotation("est_rows")) {
sb.append(" [rows=").append(getEstimatedRows());
if (hasAnnotation("est_cost")) {
sb.append(", cost=").append(String.format("%.2f", getEstimatedCost()));
}
sb.append("]");
}
sb.append("\n");
// recurse into children...
}
This produces output like:
└── Scan[products] [rows=7, cost=0.07]
Once optimization rules and the cost model are in place, the same printer will show richer trees with cost annotations at every level - making it easy to see the impact of each transformation.
Rule
The interface for optimization transformations. Each rule does one thing: it matches a pattern in the plan tree and replaces it with something better.
public interface Rule {
String getName();
boolean matches(LogicalNode node);
LogicalNode apply(LogicalNode node);
}
matches is the pattern-matching phase - it returns true if this rule can do something with this node. apply performs the actual transformation. This separation matters because the rule engine calls matches on every node in the tree and only calls apply when there's a hit. It also makes rules easy to test: you can check that matches returns true for a given tree shape and false for others, independently of whether apply produces the right output.
We'll implement concrete rules - predicate pushdown, projection pushdown, filter merge, and join reorder - in Part 3.
CostModel
The interface for estimating how expensive a plan will be to execute:
public interface CostModel {
double estimate(LogicalNode node);
long estimateCardinality(LogicalNode node);
CostConfig getConfig();
}
estimate returns an overall cost (in arbitrary units). estimateCardinality returns the estimated number of output rows. The CostConfig makes the model's constants tunable:
class CostConfig {
public double PAGE_COST = 1.0; // Cost to read one page
public double TUPLE_COST = 0.01; // Cost to process one tuple
public int PAGE_SIZE = 100; // Tuples per page
public double COMPARISON_COST = 0.001; // Cost of one comparison
public double HASH_COST = 0.005; // Cost to hash one tuple
}
Making these configurable isn't just for tidiness. It lets you demonstrate how changing PAGE_COST (which models I/O-bound workloads vs. fast SSDs) changes the optimizer's plan choices. We'll explore this in Part 4.
Iterator
The Volcano-model execution interface. Every physical operator implements this to produce tuples one at a time:
public interface Iterator {
void open(); // Initialize, allocate resources
Tuple next(); // Get next tuple, or null if done
void close(); // Clean up resources
String describe(); // For debugging
}
The beauty of this model is composability. A filter operator calls next() on its child, checks the predicate, and either returns the tuple or calls next() again. A join operator calls next() on both children. The executor just calls next() on the root until it returns null. Each operator is a self-contained unit that knows nothing about the overall plan structure. We'll implement physical operators in Part 6.
PhysicalNode
Similar to LogicalNode, but for physical plans - the layer where algorithm choices have been made:
public abstract class PhysicalNode {
private final Map<String, Object> annotations = new HashMap<>();
public abstract List<PhysicalNode> getChildren();
public abstract String describe();
public abstract double estimateCost();
// Same annotation system as LogicalNode...
}
The main difference is estimateCost() - a physical node knows its own cost because it knows which algorithm it represents.
Putting it together: a hardcoded plan
We don't have a parser yet (that's Part 2), but we can still exercise the foundation by building a plan by hand. This is exactly what the FoundationDemo does:
// Load tables into the catalog
Catalog catalog = new Catalog();
TableMetadata customers = catalog.loadTableFromCSV("customers", "customers.csv");
TableMetadata products = catalog.loadTableFromCSV("products", "products.csv");
// Query metadata
System.out.println("Customers: " + customers.getRowCount() + " rows");
System.out.println("Schema: " + customers.getSchema());
ColumnStats cityStats = customers.getColumnStats("city");
System.out.println("City NDV: " + cityStats.numDistinctValues());
For the plan itself, we create a simple scan node (a stub, since the real LogicalScan comes in the next milestone) and annotate it manually:
LogicalNode plan = new SimpleLogicalScan("products");
plan.setEstimatedRows(products.getRowCount() / 3); // assume 1/3 selectivity
plan.setEstimatedCost(estimatedRows * 0.01);
System.out.println(plan.toPrettyString());
// └── Scan[products] [rows=2, cost=0.02]
And we can build an expression and render it:
Expression expr = new Expression.BinaryOp(
Expression.BinaryOp.Operator.EQ,
new Expression.ColumnRef("products", "category"),
new Expression.Literal<>("Electronics")
);
System.out.println(expr.toSQLString());
// (products.category = 'Electronics')
This might seem like a small amount of output for a first milestone, but the value is in what it proves: the catalog loads data and computes statistics, expressions can be built and rendered, plan nodes carry annotations and print themselves. Every layer above this - parsing, optimization, execution - will rely on these foundations.
What production systems do differently
It's worth pausing to acknowledge where our catalog is simplified compared to real systems:
Metadata versioning and caching. In PostgreSQL, the catalog is itself a set of system tables (pg_class, pg_attribute, pg_stats) that can be queried and that evolve with DDL operations. Statistics are computed asynchronously by the autovacuum daemon and cached. Our catalog recomputes everything on load - fine for small CSV files, impractical for a real database.
Type system and semantic analysis. Production optimizers have a rich type system with implicit casts, function overloading, type inference, and constraint propagation. Our three types and simple parse method skip all of that.
Expression complexity. Real SQL supports arithmetic in expressions (price * 1.1), function calls (UPPER(name)), CASE expressions, subqueries, and more. We deliberately restrict to column references, literals, and comparisons. Adding these later is straightforward because the Expression interface is extensible - you'd just add new record types.
Distributed metadata. In systems like CockroachDB or TiDB, the catalog is replicated across nodes with consistency guarantees. Our single HashMap works because we run in a single JVM.
These simplifications are the right trade-offs for a teaching project. The architecture - a central catalog that other components query, an expression tree shared across layers, annotated plan nodes - is exactly what production systems use. The sophistication grows within these same abstractions.
What's next
We have a catalog that knows about tables and their statistics, an expression system that can represent predicates and evaluate them, and a set of interfaces that define the contracts between layers. In the next post, we'll build the parser and the logical plan builder, and see our first SQL queries transformed into relational algebra trees.