/* This file is part of VoltDB. * Copyright (C) 2008-2017 VoltDB Inc. * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU Affero General Public License as * published by the Free Software Foundation, either version 3 of the * License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with VoltDB. If not, see <http://www.gnu.org/licenses/>. */ /** * <h1>What is the package org.voltdb.planner?</h1> * <p>This package contains the cost-based, flexible, distributed * planner/optimizer for VoltDB.</p> * <p> * The execution of any query is a some kind of a scan. By <em>scan</em> we mean a loop, * or perhaps a nested loop. The most naive, and slowest, scan would be just a scan of a table. * Each row of the table is fetched, evaluated and written to the output or else skipped, depending * the filters of join conditions or the WHERE clause. A more sophisticated method would be * to scan an index. This scans only a fraction of the rows. But it's still a scan.</p> * * <p> * The planner creates <em>plans</em>. A plan is a kind of description of the execution of * a query. It may help to think of a plan as an executable program file, say an ELF or * a.out file. The plan contains the instructions needed to do the computation, but the * plan doesn't have any resources to do any calculations. The plan is just a data structure.</p> * * <p> * Each plan is made of a tree of <em>plan nodes</em>. Each plan node describes * some computation which operates on tables and produces a single table. The result table * and the argument tables may be persistent tables or they may be temporary tables. Temporary * tables created during the execution of a plan are destroyed after the execution of the * plan.</p> * * <p> * For example, a SEQUENTIAL_SCAN node describes a computation which operates on a * table, reads each row of the table and sends it to the output, possibly filtering * with a predicate or applying a limit or an offset. A NESTED_LOOP plan node describes * a join computation, either an inner or outer join. * This join computation will be given two tables which must be scanned * in two nested loops. For each row of the left hand table, which we call the <em>outer</em> * table, we look at each row of the right hand table, which we call the <em>inner</em> * table. If the pair of rows satisfy the join condition, we write the row to the * output table.</br> * Note: The computation for NESTED_LOOP is somewhat more complicated * than this, as we may be able to skip the scan of the inner table. There * may be other optimizations.</br> * Note Also: The tree formed by a plan is not very similar to the parse tree formed * from a SQL or DDL statement. The plan relates to the SQL text approximately the * same way an assembly language program relates to a string of C++ program text.</p> * * <p> * We say a plan node <em>describes</em> a computation because the plan node does * not actually have the code to do the computation. Plan nodes are generated by * the planner in the Java code in the VoltDB server. Plans are transmitted to the * Execution Engine, or EE, in the form of JSON objects, represented as strings. * The EE has its own definitions for all the plan nodes in C++. The * EE can reconstruct the the plan in a C++ data structure from the JSON. The EE also * has a hierarchy of classes called <em>executor classes</em> which is parallel to * the plan node hierarchy. Executor class objects contain the code which does * the real computations of the plan nodes. Most dynamic data, such as table * data, are stored in the executor objects.</p> * * <p> * The children of a plan * node describe computations which produce temporary tables. The parents of a plan * node describe computations which consume the result of the plan node. So, child * nodes could also be called <em>producer</em> nodes, and parent nodes could also * be called <em>consumer</em> nodes.</br> * Note: A node may have only one parent node. A node may, therefore, create only * one value.</p> * * <p>The output schema of a plan node gives the set of columns which the computation * of the plan node will create. For each row which is not rejected, some computation * is done to the row, and the results are put into an output row. * * <p> * Some operations can be computed on a row-wise basis. For example, if * a table has N columns, but only M < N columns are used in the output, the projection * from N columns to M columns happens independently for all rows. * Similarly, an output column can be a function of input columns. These * functions can by computed row-wise. Finally, most * aggregate functions can be computed row-wise. For example, the sum * aggregate function needs only a running sum and values from the current row * to update its state. Not all operations can be computed row-wise. * For example, a sort requires an entire table. * For row-wise operations, we avoid creating extra temporary * tables by combining scans. The row-by-row operations are placed in <em>inline</em> * nodes in a scan node. The inline nodes are applied to each row of * the scan which passes the filters during the scan. The inline nodes have * their own output columns, as discussed above.</p> * * <p> * The classes of plan nodes are organized in a hierarchy. The root of the * hierarchy is the AbstractPlanNode class. * </p> * * <h1>How Does The Planner Work In Detail?</h1> * <p> * The planner takes a data structure describing a DML or DQL SQL * statement. It produces a plan of execution which can be used to * evaluate the statement. The Execution Engine executes the plan. * Note that DDL is processed entirely differently, and is generally * not part of this discussion. DDL which has embedded DQL or DML * will need to be planned, as detailed here. For example, materialized * views, single statement procedure definitions and delete statement * limit plans are DDL statements which have plannable elements.</p> * <h2>The Input</h2> * <p> * The input to the planner is originally a SQL statement, which is a * text string. The HSQL front end translates the user’s text into an * HSQL data structure which represents the statement. This data * structure is translated into a second data structure called * VoltXML, which is similar to XML. VoltXML has elements, called * VoltXMLElements, and elements have string valued attributes, just * like XML. There is no support for namespaces in VoltXML, though it * is hardly needed. This VoltXML is translated again into another * data structure called AbstractParsedStatement and its subclasses. * This AbstractParsedStatement, along with a catalog defining the * tables, indexes and other DDL objects, is the input to the * planner.</p> * * <p> * An AbstractParsedStatement object contains several pieces. * <ol> * <li>All the parts of the user’s original SQL statement, * including * <ol> * <li>The display list,</li> * <li>The list of joined tables, including * <ol> * <li>Joined tables,</li> * <li>Derived tables, and</li> * <li>expressions used in WHERE and ON conditions,</li> * </ol> * </li> * <li>The list of GROUP BY expressions,</li> * <li>The HAVING expression, if any,</li> * <li>Any LIMIT or OFFSET constants.</li> * <li>An indication of whether the statement is a SELECT, DELETE * INSERT, UPSERT or UPDATE statement, or a set operation like * UNION, or INTERSECT. As we said * before, DDL is generally not planned, but some parts of * DDL statements may require planning.</li> * </ol> * </li> * </ol> * Note that a joined or derived table is associated with a tablescan, * and that the planner generally works with tablescans rather than * tables. For example, in this SQL statement: * <pre> * select * * from t as left join t as right * on left.a = right.a; * </pre> * the aliases <code>left</code> and <code>right</code> name * the table T, which is scanned twice independently as the * left and right scans of the join.</p> * * <p> * The SQL standard and HSQL refer to tablescans as range * variables, but the planner calls them tablescans. These are * essentially a table with an alias to distinguish between different * references of the same table in a self join.</p> * * <h2>The Plan Nodes In Detail</h2> * <p> * Plan nodes compute some operation on one or more input tables. For * example, an ORDERBY plan node sorts its only input table. A * NESTEDLOOPJOIN node takes two tables as input and computes a join * in a nested loop, scanning the outer table in the outer loop, and * the inner table in the inner loop.</p> * * <p> * There are about 24 different plan nodes, organized in five categories. The categories are: * <ul> * <li>Operator nodes, for executing DML operations such as DELETE, INSERT and UPDATE,</li> * <li>Scan nodes for scanning tables and indexes,</li> * <li>Join nodes for calculating different kinds of joins,</li> * <li>Communication nodes for sending result tables to the * coordinator or receiving results from distributed executions,</li> * <li>Miscellaneous nodes. These include * <ul> * <li>Nodes to compute aggregates and GROUP BY. The aggregation algorithms include * <ul> * <li>Serial Aggregation, which is aggregation based on sorting,</li> * <li>Hash Aggregation, which is aggregation based on hash tables,</li> * <li>Partial Aggregation, which is a form of parallelized * serial aggregation.</li> * </ul> * </li> * <li>Nodes to compute set operations, like UNION or INTERSECT,</li> * <li>Nodes to compute ORDERBY,</li> * <li>Nodes to compute PROJECTION,</li> * <li>Nodes to compute MATERIALIZE,</li> * <li>Nodes to compute LIMIT and OFFSET,</li> * <li>Nodes to compute WINDOWFUNCTION.</li> * </ul> * </li> * </ul> * </p> * * <h2>The Execution Model</h2> * <h3>Plans and Fragments</h3> * <p> * The plan nodes are connected in one or two fragments. Each * fragment is a plan node tree. * <ul> * <li>A one-fragment plan is a single partition plan. It is * scheduled on a particular partition, and the resulting table is * sent to the user.</li> * <li>A two-fragment plan is a multi partition plan. The two * fragments are called the coordinator fragment and the distributed * fragment.</li> * <li>The distributed fragment is scheduled to be sent to all the * partitions and executed in parallel on the partitions.</li> * <li>The results of the executions of the distributed fragment on * each of the partitions are sent to the coordinator node which * executes the coordinator fragment. The coordinator fragment * has a RECEIVE or a MERGERECEIVE node which knows how to * combine the distributed results into a single table.</li> * <li>Plan node execution is similar for both the distributed and * coordinator fragment.</li> * </ul> * </p> * * <p> * Plan nodes have different kinds of children. A plan node can have * child nodes which are out-of-line or inline. The latter are used to * compute which can be computed row-wise, in a single pass through * the input table along with the parent node. For example, a LIMIT * or OFFSET node is often made inline, since it just counts how many * initial rows to allow or ignore. PROJECTION nodes can also be * inlined, since they operate only row-wise. An ORDERBY node can * generally not be inlined, since the entire table must be present to * sort it. But if the ORDERBY node’s input table comes from a * RECEIVE node, and the planner can tell that the input is sorted, * then RECEIVE node can be replaced with a MERGERECEIVE node and the * ORDERBY node can be inlined in the MERGERECEIVE node. The * MERGERECEIVE node gets sorted input tables from distributed * fragments. The MERGERECEIVE node merges the input tables by * choosing the first value of all its input tables which is smallest. * Indeed, the MERGERECEIVE node must have an inline ORDERBY node to * carry the ordering expressions. </p> * * <h3>Fragment initialization</h3> * <p> * The plan nodes have a method for initialization. They first * initialize their children, then they initialize themselves.</p> * * <h3>Fragment Execution</h3> * <p> * A fragment tree executes the operations of each of its plans, * starting with the leaves. At each node the execution function reads rows from * the input tables, computes a function into a single row, perhaps * alters that row with inline plan nodes and writes the output row to * a single output table. A plan node can only have a single parent and a * single output table.</p> * * <h2>The Planner’s Passes</h2> * <p> * The planner doesn’t have passes in the way a compiler might. It’s * more of a pass itself. But it does have several phases. * <ol> * <li>First, the planner calculates some join orders. The * statement has a set of tablescans, and the efficiency of a * plan depends on the order in which the tablescans are * scanned. There is a large amount of confusing code devoted * to this, but essentially each possible order of the * tablescans is generated. There is a limit to the size of * this order set. If there are more than 5 tables, the order * in the statement is used, to avoid exponential explosion. * A java stored procedures, may also have an explicit * user-provided join order specification.</li> * <li>Each join order results in a join tree, whose java type is * JoinNode. * <ol> * <li>The subtypes of JoinNode are SubqueryLeafNode, BranchNode * and TableLeafNode. These represent derived tables, which * we often call subqueries, joined tables and tablescans * respectively.</li> * <li>We do some analysis of the join predicates, to * distribute them to the proper JoinNode nodes. For example, * <ul> * <li> * If an expression contains only references to a single * tablescan, that expression can be placed along with * the TableLeafNode for the tablescan, to know if the * expression can be used with an index.</li> * <li> * Similarly, if the tablescan is the outer scan of a * join, it can also be added to an outer-only list in * the join node. In this case we will know to evaluate * these outer-only nodes when scanning and not scan the * inner table at all if the expression’s evaluation is * false.</li> * <li> * Expressions which reference both branches of a join * might be outer expressions of a higher level * join.</li> * <li>For OUTER joins, the differing semantics of ON clauses * and WHERE clauses also factor into where in the join * tree a predicate should apply.</li> * </ul> * </li> * <li> * The planner calculates a set of access paths for the tablescans. * Note that this does not depend on the join order.</li> * <ol> * <li> * These access paths will tell how the tablescan will be * scanned. It may be scanned directly or through an index.</li> * <li> * The scan associated with the access path may provide * ordering for a statement level order by or a window * function. This is stored in the access path as well.</li> * <li> * For each join order, the planner creates a subplan in * SubPlanAssembler. This class traverses the join tree * and, consulting with the access paths, turns the * TableLeafNode and SubqueryLeafNode join nodes into some * kind of AbstractScanPlanNode, and turns the BranchNodes * into some kind of AbstractJoinPlanNode. The result is a * set of subplans. These are just trees of scans and * joins. The other parts of the statement, order by, * group by, limit, offset, having, and DML operations * table insertion, update and deletion are not represented * here. In particular, if the plan is a multi-partition * plan, it needs a SEND/RECEIVE pair to designate the * boundary between the coordinator and distributed plans. * But this pair will not be part of the subplan.</li> * <li> * For each subplan, the plan assembler in PlanAssembler * adds new nodes on the top of the subplan tree, to form * the complete tree. These may include: * <ul> * <li>A SEND node on the top of the plan, to send the * result to the server.</li> * <li>SEND/RECEIVE pairs in multi-partition queries,</li> * <li>Aggregate nodes, to calculate GROUP BY groups and * aggregate functions.</li> * <li>Window function nodes to calculate window * functions,</li> * <li>ORDERBY nodes to implement sorting.</li> * <li>PROJECTION nodes to calculate display lists.</li> * </ul> * All but the first two are optional.</li> * </ol> * </li> * </ol> * </li> * </ol> * </p> * * <p> * Note that the transformation from ORDERBY->RECEIVE to * MERGERECEIVE(ORDERBY inlined) discussed above is not done in the * plan assembler. Also, sometimes nodes which could be made inline * are not made so by the plan assembler. These and other final * transformations could be performed in the plan assembler. But it * seems simpler to do them locally, after the plan tree is formed.</p> * * <h2>The Microoptimizer</h2> * <p> * The Microoptimizer takes a plan and applies various transformations. The current set of transformations include: * <ol> * <li> * Pushing limit and offset nodes into distributed fragments and * making them inline. If a statement has LIMIT 100, for example, * there is no reason for distributed fragments to generate more * than 100 rows. Not all of these may be used.</li> * <li> * Transforming plans with ORDERBY, LIMIT and OFFSET into plans * with MERGERECEIVE and inline ORDERBY, LIMIT and OFFSET plan * nodes.</li> * <li> Make aggregate nodes be inline. * <ol> * <li> * A hash aggregate node reads each input row, calculates group * by keys and a hash value from this calculation. The hash * aggregate node puts the group by key value into a hash * table. It also updates the aggregate function values. For * example, if GBV1 and GBV2 are a group by keys, then the * expression MIN(GBV1 * GBV2) would compute the minimum of * the product of GBV1 and GBV2. This is a row-wise * operation.</li> * <li> * A serial aggregate node reads each input row and calculates * the group by keys. The input table will be sorted, so if * the group by key values change we know we have seen an * entire group and output a row. In any case we update * aggregate values, like the MIN(GBV1*GBV2) expression above. * So serial aggregate nodes can be calculated row-wise.</li> * <li> * Partial aggregation is a kind of serial aggregation.</li> * </ol> * So, the aggregate node’s calculation can be done row-wise. This * means the aggregate node can be inlined in some cases.</li> * <li> * Replace an aggregate plan node with an index count plan node or * a table count plan node.</li> * <li> * Replace an aggregate plan node with an index scan plan node with * an inline limit node.</li> * </ol> * </p> */ package org.voltdb.planner;