अभिव्यक्ति

Building a simple Query Optimizer from scratch: Part 2

From SQL to Canonical Logical Plans

This series

Code for this series

This is Part 2 in a 6-part series on building a query optimizer from scratch in Java. Part 1 covered the catalog, expression system, and core interfaces.

In the previous post we built the foundation: a catalog that knows about tables and their statistics, an expression system for representing predicates, and the abstract interfaces that the optimizer will be built around. But we tested everything with a hardcoded plan - there was no way to turn actual SQL into a plan tree.

In this post, we'll build a parser and a logical plan builder. By the end, you'll be able to hand the system a SQL string like SELECT name, price FROM products WHERE category = 'Electronics' AND price > 100 and get back a clean, canonical logical plan tree that's ready for optimization.

Defining the SQL subset

Before writing a parser, you have to decide what it will accept. This is where a teaching project differs most sharply from a production system. We deliberately support a narrow slice of SQL:

SELECT column1, column2, COUNT(*), SUM(column3)
FROM table1
  INNER JOIN table2 ON table1.id = table2.fk
  INNER JOIN table3 ON table2.id = table3.fk
WHERE column1 = 'value'
  AND column2 > 100
  AND table1.id = table2.id
GROUP BY column1, column2

What's explicitly not supported (documented as future work): subqueries, UNION, DISTINCT, outer joins, complex expressions (arithmetic, functions, CASE), ORDER BY, LIMIT, HAVING, and more than three-way joins.

Narrowing the scope like this isn't cutting corners - it's a deliberate trade-off. Every unsupported feature would add complexity to the parser, the AST, the logical plan builder, the rule engine, and the executor. By keeping the surface small, we can focus on getting the core right: the architecture, the canonical form, and the optimization pipeline.

The Abstract Syntax Tree

The parser's job is to turn a SQL string into a structured tree - the AST - that captures the syntactic structure of the query without any semantic interpretation. The AST doesn't know about table schemas or column types; it just faithfully represents what the user wrote.

We define the AST as a set of Java records nested inside an AST interface. Records are a natural fit: AST nodes are immutable data containers, and records give us structural equality and destructuring for free.

The top-level structure

public interface AST {
    record SelectStmt(
        List<SelectItem> selectItems,
        FromClause from,
        Expr whereClause,
        List<String> groupByColumns
    ) implements AST {
        public boolean hasWhere() { return whereClause != null; }
        public boolean hasGroupBy() { return !groupByColumns.isEmpty(); }
    }
}

A SelectStmt is our root node. It captures the four clauses of our supported SQL: what to select, where to read from, how to filter, and how to group.

SELECT items

The SELECT list can contain plain column references or aggregate function calls. These two cases get their own types:

interface SelectItem extends AST {
    String getAlias();
}

record ColumnSelectItem(String tableName, String columnName) implements SelectItem {
    public static ColumnSelectItem from(String columnName) {
        return new ColumnSelectItem(null, columnName);
    }

    @Override
    public String getAlias() { return columnName; }
}

record AggregateSelectItem(AggFunc function, String columnName) implements SelectItem {
    public enum AggFunc { COUNT, SUM, AVG, MIN, MAX }

    @Override
    public String getAlias() {
        return function.name().toLowerCase() + "_" + columnName;
    }
}

ColumnSelectItem can be qualified (c.name, with tableName = "c") or unqualified (name, with tableName = null). The getAlias method provides a default output name - this becomes the column name in the logical plan's Project node.

FROM clause

The FROM clause is either a single table or a chain of joins:

interface FromClause extends AST {
    List<String> getTableNames();
}

record TableRef(String tableName, String alias) implements FromClause {
    public String getEffectiveName() {
        return alias != null ? alias : tableName;
    }

    @Override
    public List<String> getTableNames() {
        return Collections.singletonList(tableName);
    }
}

record JoinClause(
    FromClause left, FromClause right,
    JoinType joinType, Expr condition
) implements FromClause {
    public enum JoinType { INNER }

    @Override
    public List<String> getTableNames() {
        List<String> names = new ArrayList<>();
        names.addAll(left.getTableNames());
        names.addAll(right.getTableNames());
        return names;
    }
}

JoinClause is recursive: its left and right are themselves FromClause nodes, so a three-way join like FROM a JOIN b ON ... JOIN c ON ... becomes a left-deep tree of JoinClause nodes. The getTableNames method recursively collects all table names - useful later when the optimizer needs to determine which tables participate in a subtree.

Expressions

The AST has its own expression types, separate from the Expression interface in the logical layer:

interface Expr extends AST {}

record ColumnExpr(String tableName, String columnName) implements Expr {
    public static Expr from(String columnName) {
        return new ColumnExpr(null, columnName);
    }
}

record LiteralExpr<T extends Comparable<? super T>>(T value) implements Expr {}

record BinaryExpr(Op operator, Expr left, Expr right) implements Expr {
    public enum Op {
        EQ("="), NEQ("!="), GT(">"), GTE(">="), LT("<"), LTE("<="),
        AND("AND"), OR("OR");
        // ...
    }
}

You might wonder why we have two parallel expression hierarchies - AST.ColumnExpr and Expression.ColumnRef, AST.BinaryExpr and Expression.BinaryOp. The reason is separation of concerns. The AST types capture syntax: they're what the parser produces. The Expression types capture semantics: they're what the optimizer and executor work with. The LogicalPlanBuilder converts between the two. In a larger system, the semantic layer would do type checking, resolve column references against schemas, and handle implicit casts - things that don't belong in the parser.

The parser

We use a hand-written recursive descent parser. For our narrow SQL subset, this is simpler and more transparent than a parser generator like ANTLR. The parser has two phases: tokenization and recursive descent.

Tokenization

The tokenizer uses a single regex to split the SQL string into tokens:

private List<String> tokenize(String sql) {
    List<String> tokens = new ArrayList<>();
    Pattern pattern = Pattern.compile(
        "'[^']*'|" +              // Single-quoted strings
        "\\d+\\.?\\d*|" +         // Numbers (int or float)
        "[a-zA-Z_][a-zA-Z0-9_]*|" + // Identifiers
        "!=|>=|<=|" +             // Two-char operators
        "[().,=<>*]"              // Single-char operators/delimiters
    );

    Matcher matcher = pattern.matcher(sql);
    while (matcher.find()) {
        tokens.add(matcher.group());
    }
    return tokens;
}

Each alternative in the regex matches a different kind of token. The order matters: '[^']*' must come first so that keywords inside quoted strings aren't mistakenly matched as identifiers. Two-character operators (!=, >=, <=) come before single-character ones so that >= isn't split into > and =.

A regex-based tokenizer has limitations - it can't handle escaped quotes, nested strings, or Unicode identifiers - but for our subset, it works cleanly and produces a flat list of tokens that the parser can walk through.

Parser infrastructure

The parser maintains a position into the token list and provides four helper methods:

private List<String> tokens;
private int position;

private String peek() {
    return position < tokens.size() ? tokens.get(position) : null;
}

private String consume() {
    if (position >= tokens.size())
        throw new IllegalStateException("Unexpected end of query");
    return tokens.get(position++);
}

private boolean match(String expected) {
    String token = peek();
    return token != null && token.equalsIgnoreCase(expected);
}

private void expect(String expected) {
    String token = consume();
    if (!token.equalsIgnoreCase(expected))
        throw new IllegalArgumentException(
            "Expected '" + expected + "' but got '" + token + "'");
}

peek looks at the next token without consuming it - essential for decisions like "is this a column reference or an aggregate function?". match is a convenience for checking the next token against an expected value. expect consumes and asserts - used for tokens that must be there (like FROM after the select list).

Parsing SELECT

The entry point mirrors the SQL grammar:

private AST.SelectStmt parseSelect() {
    expect("SELECT");
    List<AST.SelectItem> selectItems = parseSelectList();

    expect("FROM");
    AST.FromClause from = parseFrom();

    AST.Expr whereClause = null;
    if (match("WHERE")) {
        consume();
        whereClause = parseExpression();
    }

    List<String> groupBy = new ArrayList<>();
    if (match("GROUP")) {
        consume(); // GROUP
        expect("BY");
        groupBy = parseColumnList();
    }

    return new AST.SelectStmt(selectItems, from, whereClause, groupBy);
}

This reads almost like the grammar itself: expect SELECT, parse a select list, expect FROM, parse the from clause, optionally parse WHERE, optionally parse GROUP BY. Each parse* method returns the corresponding AST node.

Parsing select items

Each item in the select list is either an aggregate function call (COUNT(*), SUM(price)) or a column reference (name, c.name):

private AST.SelectItem parseSelectItem() {
    String token = peek();

    if (isAggregateFunction(token)) {
        String funcName = consume();
        expect("(");
        String column = consume();  // column name or *
        expect(")");
        AST.AggregateSelectItem.AggFunc func =
            AST.AggregateSelectItem.AggFunc.valueOf(funcName.toUpperCase());
        return new AST.AggregateSelectItem(func, column);
    }

    // Regular column reference
    String first = consume();
    if (match(".")) {
        consume();
        String column = consume();
        return new AST.ColumnSelectItem(first, column);
    }
    return AST.ColumnSelectItem.from(first);
}

The aggregate check has to come first: when we see COUNT, we need to decide immediately that this is a function call, not a column named "count". The isAggregateFunction helper checks against the five supported function names.

Parsing FROM and JOINs

The FROM clause starts with a table reference and can be followed by zero or more INNER JOIN clauses. Each join gets folded into the left side, producing a left-deep tree:

private AST.FromClause parseFrom() {
    AST.FromClause left = parseTableRef();

    while (match("INNER") || match("JOIN")) {
        if (match("INNER")) consume();
        expect("JOIN");

        AST.FromClause right = parseTableRef();
        expect("ON");
        AST.Expr condition = parseExpression();

        left = new AST.JoinClause(left, right,
            AST.JoinClause.JoinType.INNER, condition);
    }
    return left;
}

For a query like FROM a JOIN b ON ... JOIN c ON ..., the first iteration creates JoinClause(TableRef("a"), TableRef("b"), ...), and the second iteration wraps that as JoinClause(JoinClause(...), TableRef("c"), ...). This left-deep structure matches how most SQL implementations process joins.

Table references can optionally have aliases:

private AST.TableRef parseTableRef() {
    String tableName = consume();
    String alias = null;

    if (!match("INNER") && !match("JOIN") && !match("ON") &&
            !match("WHERE") && !match("GROUP") && !match(",")
            && peek() != null) {
        String next = peek();
        if (!isKeyword(next)) {
            alias = consume();
        }
    }
    return new AST.TableRef(tableName, alias);
}

The alias logic is a bit of a lookahead puzzle: after reading a table name, we check whether the next token is a keyword or delimiter. If it's not, it must be an alias. This is one of those places where a hand-written parser requires a little care, but it's still straightforward compared to the alternatives.

Parsing expressions

Expressions are parsed with the classic recursive descent approach, where each precedence level gets its own method:

private AST.Expr parseExpression() {
    return parseOrExpression();
}

private AST.Expr parseOrExpression() {
    AST.Expr left = parseAndExpression();
    while (match("OR")) {
        consume();
        AST.Expr right = parseAndExpression();
        left = new AST.BinaryExpr(AST.BinaryExpr.Op.OR, left, right);
    }
    return left;
}

private AST.Expr parseAndExpression() {
    AST.Expr left = parseComparisonExpression();
    while (match("AND")) {
        consume();
        AST.Expr right = parseComparisonExpression();
        left = new AST.BinaryExpr(AST.BinaryExpr.Op.AND, left, right);
    }
    return left;
}

OR has the lowest precedence, then AND, then comparisons. This means a = 1 AND b = 2 OR c = 3 parses as (a = 1 AND b = 2) OR c = 3, which matches standard SQL semantics.

The comparison parser handles the six comparison operators:

private AST.Expr parseComparisonExpression() {
    AST.Expr left = parsePrimaryExpression();
    String op = peek();

    if (op != null) {
        AST.BinaryExpr.Op operator = null;
        if (op.equals("="))       operator = AST.BinaryExpr.Op.EQ;
        else if (op.equals("!=")) operator = AST.BinaryExpr.Op.NEQ;
        else if (op.equals(">"))  operator = AST.BinaryExpr.Op.GT;
        else if (op.equals(">=")) operator = AST.BinaryExpr.Op.GTE;
        else if (op.equals("<"))  operator = AST.BinaryExpr.Op.LT;
        else if (op.equals("<=")) operator = AST.BinaryExpr.Op.LTE;

        if (operator != null) {
            consume();
            AST.Expr right = parsePrimaryExpression();
            return new AST.BinaryExpr(operator, left, right);
        }
    }
    return left;
}

And the primary expression parser handles the leaf cases - literals, column references, and parenthesized sub-expressions:

private AST.Expr parsePrimaryExpression() {
    String token = peek();

    if (match("(")) {
        consume();
        AST.Expr expr = parseExpression();
        expect(")");
        return expr;
    }

    if (token.startsWith("'")) {
        consume();
        String value = token.substring(1, token.length() - 1);
        return new AST.LiteralExpr(value);
    }

    if (token.matches("\\d+\\.\\d+")) {
        consume();
        return new AST.LiteralExpr(Float.parseFloat(token));
    }

    if (token.matches("\\d+")) {
        consume();
        return new AST.LiteralExpr(Integer.parseInt(token));
    }

    // Must be a column reference
    String first = consume();
    if (match(".")) {
        consume();
        String column = consume();
        return new AST.ColumnExpr(first, column);
    }
    return AST.ColumnExpr.from(first);
}

The ordering of checks matters: quoted strings are checked first (they start with '), then floats (contain a dot), then integers, and finally identifiers. A qualified column like c.name is distinguished from a bare column by peeking for a . after consuming the first identifier.

Logical operators

With the parser producing ASTs, we need the logical plan nodes that the plan builder will assemble. Each one is a concrete subclass of the LogicalNode base class we defined in Part 1.

LogicalScan

The leaf of every plan tree - it reads all rows from a named table:

public class LogicalScan extends LogicalNode {
    private final String tableName;

    public LogicalScan(String tableName) {
        this.tableName = tableName;
    }

    @Override
    public List<LogicalNode> getChildren() {
        return Collections.emptyList();
    }

    @Override
    public LogicalNode withChildren(List<LogicalNode> children) {
        if (!children.isEmpty())
            throw new IllegalArgumentException("Scan cannot have children");
        return this;
    }

    @Override
    public String describe() { return "Scan[" + tableName + "]"; }
}

A scan has no children - it's where data enters the plan.

LogicalFilter

Applies a single predicate to its input:

public class LogicalFilter extends LogicalNode {
    private final Expression predicate;
    private final LogicalNode child;

    public LogicalFilter(Expression predicate, LogicalNode child) {
        this.predicate = predicate;
        this.child = child;
    }

    @Override
    public LogicalNode withChildren(List<LogicalNode> children) {
        if (children.size() != 1)
            throw new IllegalArgumentException("Filter must have exactly one child");
        return new LogicalFilter(predicate, children.get(0));
    }

    @Override
    public String describe() {
        return "Filter[" + predicate.toSQLString() + "]";
    }
}

Note the class comment: "In canonical form, each Filter node has exactly one predicate." This is a critical design choice that we'll discuss in the next section.

LogicalJoin

Combines two inputs on a condition:

public class LogicalJoin extends LogicalNode {
    public enum JoinType { INNER }

    private final LogicalNode left;
    private final LogicalNode right;
    private final JoinType joinType;
    private final Expression condition;

    @Override
    public List<LogicalNode> getChildren() {
        return Arrays.asList(left, right);
    }

    @Override
    public LogicalNode withChildren(List<LogicalNode> children) {
        if (children.size() != 2)
            throw new IllegalArgumentException("Join must have exactly two children");
        return new LogicalJoin(children.get(0), children.get(1), joinType, condition);
    }

    @Override
    public String describe() {
        return "Join[" + joinType + ", " + condition.toSQLString() + "]";
    }
}

Joins have two children and preserve their condition and type through withChildren. This matters when the rule engine rearranges a join's children - the join condition stays attached.

LogicalProject

Selects specific columns from its input:

public class LogicalProject extends LogicalNode {
    private final List<Expression> projections;
    private final List<String> columnNames;
    private final LogicalNode child;

    // ... constructor, getters ...

    @Override
    public String describe() {
        StringBuilder sb = new StringBuilder("Project[");
        for (int i = 0; i < columnNames.size(); i++) {
            if (i > 0) sb.append(", ");
            sb.append(columnNames.get(i));
        }
        sb.append("]");
        return sb.toString();
    }
}

The projections list contains expressions (usually ColumnRefs) and columnNames provides the output names. Both lists are parallel - projections.get(i) computes the value for columnNames.get(i).

LogicalAggregate

Handles GROUP BY with aggregate functions:

public class LogicalAggregate extends LogicalNode {
    public enum AggFunction { COUNT, SUM, AVG, MIN, MAX }

    public record AggregateOp(
        AggFunction function, String inputColumn, String outputColumn
    ) {}

    private final List<String> groupByColumns;
    private final List<AggregateOp> aggregateOps;
    private final LogicalNode child;

    @Override
    public String describe() {
        StringBuilder sb = new StringBuilder("Aggregate[");
        if (!groupByColumns.isEmpty()) {
            sb.append("GROUP BY: ").append(String.join(", ", groupByColumns));
        }
        if (!aggregateOps.isEmpty()) {
            if (!groupByColumns.isEmpty()) sb.append("; ");
            sb.append("AGGS: ");
            for (int i = 0; i < aggregateOps.size(); i++) {
                if (i > 0) sb.append(", ");
                sb.append(aggregateOps.get(i));
            }
        }
        sb.append("]");
        return sb.toString();
    }
}

The AggregateOp record packages each aggregate function call: the function itself (COUNT, SUM, etc.), the input column, and the output column name. For COUNT(*), the input column is literally "*".

The LogicalPlanBuilder

The plan builder is where the AST meets the logical layer. It walks the AST and constructs a tree of logical operators, enforcing canonical form along the way.

The build method

The top-level build method follows the standard SQL evaluation order - FROM first, then WHERE, then GROUP BY, then SELECT:

public class LogicalPlanBuilder {
    private final Catalog catalog;

    public LogicalPlanBuilder(Catalog catalog) {
        this.catalog = catalog;
    }

    public LogicalNode build(AST.SelectStmt stmt) {
        // Step 1: Build FROM clause (scans and joins)
        LogicalNode plan = buildFrom(stmt.from());

        // Step 2: Add WHERE predicates
        if (stmt.hasWhere()) {
            plan = buildFilters(stmt.whereClause(), plan);
        }

        // Step 3: Add aggregation if present
        if (stmt.hasGroupBy() || hasAggregates(stmt.selectItems())) {
            plan = buildAggregate(stmt, plan);
        }

        // Step 4: Add projection for SELECT list
        plan = buildProject(stmt.selectItems(), plan);

        return plan;
    }
}

Each step wraps the plan built so far with another operator: scans at the bottom, then filters, then aggregation, then projection on top. This mirrors the conceptual evaluation order of SQL - even though the optimizer will later rearrange things.

Building FROM

The buildFrom method recursively converts the FROM clause AST into scan and join nodes:

private LogicalNode buildFrom(AST.FromClause from) {
    if (from instanceof AST.TableRef tableRef) {
        return new LogicalScan(tableRef.tableName());
    } else if (from instanceof AST.JoinClause join) {
        LogicalNode left = buildFrom(join.left());
        LogicalNode right = buildFrom(join.right());
        Expression condition = convertExpression(join.condition());
        return new LogicalJoin(left, right, LogicalJoin.JoinType.INNER, condition);
    }
    throw new IllegalArgumentException("Unknown FROM clause type");
}

This is where the AST's Expr types get converted to the logical layer's Expression types. The convertExpression method handles the translation:

private Expression convertExpression(AST.Expr expr) {
    return switch (expr) {
        case AST.ColumnExpr col -> {
            if (col.tableName() != null)
                yield new Expression.ColumnRef(col.tableName(), col.columnName());
            else
                yield Expression.ColumnRef.from(col.columnName());
        }
        case AST.LiteralExpr<?> lit ->
            new Expression.Literal(lit.value());
        case AST.BinaryExpr binary -> {
            Expression left = convertExpression(binary.left());
            Expression right = convertExpression(binary.right());
            Expression.BinaryOp.Operator op = convertOperator(binary.operator());
            yield new Expression.BinaryOp(op, left, right);
        }
        default -> throw new IllegalArgumentException("Unknown expression type");
    };
}

This uses Java's pattern matching switch to dispatch on the AST expression type, creating the corresponding Expression nodes. It's a straightforward structural transformation - the tree shape is preserved, only the node types change.

Canonical form: splitting AND predicates

This is the most important design decision in the plan builder. When the WHERE clause contains AND, we don't create a single LogicalFilter with a compound predicate. Instead, we split the AND into separate LogicalFilter nodes, one per conjunct:

private LogicalNode buildFilters(AST.Expr whereExpr, LogicalNode child) {
    List<AST.Expr> predicates = splitAndPredicates(whereExpr);

    LogicalNode result = child;
    for (AST.Expr predicate : predicates) {
        Expression expr = convertExpression(predicate);
        result = new LogicalFilter(expr, result);
    }
    return result;
}

private List<AST.Expr> splitAndPredicates(AST.Expr expr) {
    List<AST.Expr> result = new ArrayList<>();

    if (expr instanceof AST.BinaryExpr binary) {
        if (binary.operator() == AST.BinaryExpr.Op.AND) {
            result.addAll(splitAndPredicates(binary.left()));
            result.addAll(splitAndPredicates(binary.right()));
            return result;
        }
    }

    result.add(expr);
    return result;
}

So WHERE city = 'Seattle' AND age > 30 AND name = 'Alice' becomes three stacked LogicalFilter nodes, not one. This is what we mean by canonical form.

Why does this matter? Consider what the predicate pushdown rule needs to do. It looks for a Filter above a Join and decides whether the filter's predicate references only the left side, only the right side, or both. If we had a single compound filter with city = 'Seattle' AND o.total > 100, the rule would have to tear apart the AND, analyze each piece, push some parts down and keep others. That's complex and error-prone.

With canonical form, each filter has exactly one predicate. The pushdown rule can examine each filter independently - one clean pattern match, one simple decision. Filters that can be pushed are pushed; those that can't stay put. Later, after all pushing is done, a separate FilterMerge rule recombines adjacent filters back into AND predicates for efficient execution. The split-then-merge lifecycle keeps each rule simple.

Here's what a query with two predicates looks like after plan building:

SQL: SELECT name, age FROM customers WHERE city = 'Seattle' AND age > 30

|-- Project[name, age]
    |-- Filter[(age > 30)]
        |-- Filter[(city = 'Seattle')]
            |-- Scan[customers]

Two separate Filter nodes, each with one predicate. Compare this to what a non-canonical form would look like:

|-- Project[name, age]
    |-- Filter[(city = 'Seattle') AND (age > 30)]
        |-- Scan[customers]

The canonical form has more nodes, but each node is simpler, and - critically - each node can be moved independently by optimization rules.

Building projections and aggregates

The remaining build methods are straightforward. The projection builder extracts column references and output names from the select list:

private LogicalNode buildProject(List<AST.SelectItem> selectItems, LogicalNode child) {
    List<Expression> projections = new ArrayList<>();
    List<String> columnNames = new ArrayList<>();

    for (AST.SelectItem item : selectItems) {
        if (item instanceof AST.ColumnSelectItem colItem) {
            Expression.ColumnRef colRef;
            if (colItem.tableName() != null)
                colRef = new Expression.ColumnRef(colItem.tableName(), colItem.columnName());
            else
                colRef = Expression.ColumnRef.from(colItem.columnName());
            projections.add(colRef);
            columnNames.add(colItem.getAlias());
        } else if (item instanceof AST.AggregateSelectItem aggItem) {
            Expression.ColumnRef colRef = Expression.ColumnRef.from(aggItem.getAlias());
            projections.add(colRef);
            columnNames.add(aggItem.getAlias());
        }
    }

    return new LogicalProject(projections, columnNames, child);
}

Aggregate items in the SELECT list are handled specially: the actual aggregation happens in the LogicalAggregate node below, so the projection just references the aggregate's output column by its alias.

The aggregate builder extracts GROUP BY columns and aggregate operations from the AST:

private LogicalNode buildAggregate(AST.SelectStmt stmt, LogicalNode child) {
    List<String> groupByColumns = stmt.groupByColumns();
    List<LogicalAggregate.AggregateOp> aggOps = new ArrayList<>();

    for (AST.SelectItem item : stmt.selectItems()) {
        if (item instanceof AST.AggregateSelectItem aggItem) {
            LogicalAggregate.AggFunction function =
                convertAggFunction(aggItem.function());
            aggOps.add(new LogicalAggregate.AggregateOp(
                function, aggItem.columnName(), aggItem.getAlias()));
        }
    }

    return new LogicalAggregate(groupByColumns, aggOps, child);
}

End-to-end examples

Let's trace a few queries through the full pipeline to see what comes out.

Simple filter and projection

SQL: SELECT name, price FROM products WHERE category = 'Electronics'

|-- Project[name, price]
    |-- Filter[(category = 'Electronics')]
        |-- Scan[products]

One scan, one filter, one projection - clean and direct.

Join with multiple predicates

SQL: SELECT c.name, o.total FROM customers c
     INNER JOIN orders o ON c.id = o.customer_id
     WHERE c.city = 'Seattle' AND o.total > 100

|-- Project[name, total]
    |-- Filter[(o.total > 100)]
        |-- Filter[(c.city = 'Seattle')]
            |-- Join[INNER, (c.id = o.customer_id)]
                |-- Scan[customers]
                |-- Scan[orders]

The two WHERE predicates become separate Filter nodes above the Join. This is exactly the setup the predicate pushdown rule needs: in Part 3, Filter[(c.city = 'Seattle')] will be pushed below the join to sit directly above Scan[customers], and Filter[(o.total > 100)] will be pushed to sit above Scan[orders].

Multi-way join

SQL: SELECT p.name, o.total, c.name
     FROM products p
     INNER JOIN orders o ON p.id = o.product_id
     INNER JOIN customers c ON o.customer_id = c.id
     WHERE p.category = 'Electronics'

|-- Project[name, total, name]
    |-- Filter[(p.category = 'Electronics')]
        |-- Join[INNER, (o.customer_id = c.id)]
            |-- Join[INNER, (p.id = o.product_id)]
            |   |-- Scan[products]
            |   |-- Scan[orders]
            |-- Scan[customers]

The left-deep join structure is visible: products joins orders first, and that result joins customers. The filter on p.category sits above everything - the optimizer will push it all the way down to just above Scan[products].

Aggregation

SQL: SELECT category, COUNT(*), AVG(price) FROM products GROUP BY category

|-- Project[category, count_*, avg_price]
    |-- Aggregate[GROUP BY: category; AGGS: COUNT(*) AS count_*, AVG(price) AS avg_price]
        |-- Scan[products]

The Aggregate node sits between the Scan and the Project, computing groups and aggregates. The Project on top simply selects the output columns.

What production systems do differently

The gap between our parser/planner and a production system is significant, but the architecture is the same. Here's where real systems invest more:

Full SQL support with semantic analysis. A production planner resolves column names against schemas (is id from the customers table or orders?), checks types (can you compare an integer to a string?), expands views, and handles dozens of SQL features we've skipped. Apache Calcite's SqlValidatorImpl alone is thousands of lines.

Type checking and implicit casts. When you write WHERE price > 100, a production system figures out that price is FLOAT and 100 is INTEGER, inserts an implicit cast, and propagates the result type. Our system sidesteps this by doing runtime comparison in the Expression.BinaryOp.evaluate method.

Query rewrite at the SQL level. Before even building a logical plan, production systems often rewrite the query: expand views, unnest subqueries, simplify IN lists, fold constant expressions. These are essentially optimizations that operate on the AST rather than the logical plan.

Plan validation. A production planner validates that the plan is semantically correct: all referenced columns exist, aggregate functions appear only with GROUP BY, join conditions reference the right tables. Our plan builder trusts the input and will produce garbage if given invalid SQL.

Despite these differences, the core idea is identical: parse SQL into a tree, convert that tree into a canonical logical form, and hand it off to the optimizer. The canonical form - especially the split-AND pattern - is the key insight that makes rule-based optimization tractable.

What's next

We now have a complete front end: SQL goes in, canonical logical plans come out. The plans are correct but unoptimized - filters sit above joins instead of below them, and the join order is whatever the user wrote in the FROM clause. In the next post, we'll build the rule engine that transforms these plans into more efficient arrangements, starting with predicate pushdown, projection pushdown, and greedy join reordering.

Next up: Part 3 - A Composable Rule Engine and Logical Rewrites