Query Optimizer

Query optimizer is a database component that optimizes the execution of the query plans according to heuristic or cost based rules.

There are two high-level strategies for query optimization.

  • Static or Heuristics rules : They match portions of the query with known patterns to assemble a plan. These rules transform the query to remove inefficiencies. Although these rules may require consultation of the catalog to understand the structure of the data, they never need to examine the data itself.

  • Perform filters as early as possible (predicate pushdown).

  • Select only those columns which are necessary (projection pushdown).

  • Split sub-queries into individual composable groups and focus on optimizing those (Subquery optimization).

  • Nested sub-queries rewrite. Basically flatting the sub-query to remove the additional level of nesting.

  • Cost based rules : As opposed to rules based, what better you can do is determine the cost of each plan depending on various factors like data distribution of your records, cardinality, etc to find out the best optimal plan on the fly.

Query Plan architecture

Logical vs Physical Plans:

  • Logical : The optimizer generates a mapping of a logical algebra expression to the optimal equivalent physical algebra
    expression. The logical plan is roughly equivalent to the relational algebra expressions in the query.
  • Physical : Physical operators define a specific execution strategy using an access path for the different operators in
    the query plan. Physical plans may depend on the physical format of the data that is processed (i.e. sorting,
    compression).
  • There does not always exist a one-to-one mapping from logical to physical plans.

Cost Estimations

  • Database systems use cost models to estimate the cost of executing a plan.
  • The cost of each query plan is evaluated based on few metrics :
    • CPU : small cost, but tough to estimate.
    • Disk I/O : number of random/sequential page accesses.
    • Memory : amount of DRAM/RAM (buffer pool usage) used.
    • Network : number of msgs sent over the wire.

Plan Enumeration

The query optimizer evaluates different plans for a query and selects the most optimal one after exhausting all the plans or after some timeout.

Single-Relation Query Plans

Picking the best access method for accessing the records/tuples :

  • Sequential Scan
  • Binary Search (clustered indexes)
  • Index Scan (Use BTree indexes to retrieve the records instead of doing random I/Os)

Multi-Relation Query Plans

When the SQL query is referring to multiple tables/joins.

  • Bottom-up (System R) :

  • Use static rules to perform the initial optimisation. Then use DP to determine the best join order for tables using a divide-and-conquer search method.

  • Break query up into blocks and generate the logical operators for each block.

  • For each logical operator, generate a set of physical operators that implement it

  • Then, iteratively construct a ”left-deep” tree that minimizes the estimated amount of work to execute the plan.

  • Examples : IBM System R, DB2, MySQL, Postgres, most open-source DBMSs.

  • Top-Down (Volcano Model) :

  • Start with a logical plan of what we want the query to be. Perform a branch and bound search to traverse the plan tree by covering logical operators into physical operators.

  • Keep track of global best plan during search.

Appendix