Page History

...

Code Block
EXPLAIN [EXTENDED\|CBO\|AST\|DEPENDENCY\|AUTHORIZATION\|LOCKS\|VECTORIZATION\|ANALYZE] query

AUTHORIZATION is supported from HIVE 0.14.0 via HIVE-5961. VECTORIZATION is supported from Hive 2.3.0 via HIVE-11394. LOCKS is supported from Hive 3.2.0 via HIVE-17683.

...

A mapping from table alias to Map Operator Tree – This mapping tells the mappers which operator tree to call in order to process the rows from a particular table or result of a previous map/reduce stage. In Stage-1 in the above example, the rows from src table are processed by the operator tree rooted at a Reduce Output Operator. Similarly, in Stage-2 the rows of the results of Stage-1 are processed by another operator tree rooted at another Reduce Output Operator. Each of these Reduce Output Operators partitions the data to the reducers according to the criteria shown in the metadata.
A Reduce Operator Tree – This is the operator tree which processes all the rows on the reducer of the map/reduce job. In Stage-1 for example, the Reducer Operator Tree is carrying out a partial aggregation whereas the Reducer Operator Tree in Stage-2 computes the final aggregation from the partial aggregates computed in Stage-1.

The

...

CBO Clause

Outputs the query's Abstract Syntax Tree.

Example:

Code Block
EXPLAIN AST FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

...

The CBO clause outputs the plan generated by Calcite optimizer. It can optionally include information about the cost of the plan using Calcite default cost model and cost model used for join reordering. Since Hive release 4.0.0 (HIVE-17503 / HIVE-21184).

Syntax: EXPLAIN [FORMATTED] CBO [COST|JOINCOST]

COST option prints the plan and cost calculated using Calcite default cost model.
JOINCOST option prints the plan and cost calculated using the cost model used for join reordering.

For example, we can execute the following statement:

Code Block

EXPLAIN CBO
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk,
  sr_store_sk AS ctr_store_sk,
  SUM(SR_FEE) AS ctr_total_return
  FROM store_returns, date_dim
  WHERE sr_returned_date_sk = d_date_sk
    AND d_year =2000
  GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return > (SELECT AVG(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
  AND s_store_sk = ctr1.ctr_store_sk
  AND s_state = 'NM'
  AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100

The query will be optimized and Hive produces the following output:

Code Block

CBO PLAN:
HiveSortLimit(sort0=[$0], dir0=[ASC], fetch=[100])
  HiveProject(c_customer_id=[$1])
    HiveJoin(condition=[AND(=($3, $7), >($4, $6))], joinType=[inner], algorithm=[none], cost=[not available])
      HiveJoin(condition=[=($2, $0)], joinType=[inner], algorithm=[none], cost=[not available])
        HiveProject(c_customer_sk=[$0], c_customer_id=[$1])
          HiveFilter(condition=[IS NOT NULL($0)])
            HiveTableScan(table=[[default, customer]], table:alias=[customer])
        HiveJoin(condition=[=($3, $1)], joinType=[inner], algorithm=[none], cost=[not available])
          HiveProject(sr_customer_sk=[$0], sr_store_sk=[$1], $f2=[$2])
            HiveAggregate(group=[{1, 2}], agg#0=[sum($3)])
              HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[not available])
                HiveProject(sr_returned_date_sk=[$0], sr_customer_sk=[$3], sr_store_sk=[$7], sr_fee=[$14])
                  HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL($7), IS NOT NULL($3))])
                    HiveTableScan(table=[[default, store_returns]], table:alias=[store_returns])
                HiveProject(d_date_sk=[$0])
                  HiveFilter(condition=[AND(=($6, 2000), IS NOT NULL($0))])
                    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
          HiveProject(s_store_sk=[$0])
            HiveFilter(condition=[AND(=($24, _UTF-16LE'NM'), IS NOT NULL($0))])
              HiveTableScan(table=[[default, store]], table:alias=[store])
      HiveProject(_o__c0=[*(/($1, $2), 1.2)], ctr_store_sk=[$0])
        HiveAggregate(group=[{1}], agg#0=[sum($2)], agg#1=[count($2)])
          HiveProject(sr_customer_sk=[$0], sr_store_sk=[$1], $f2=[$2])
            HiveAggregate(group=[{1, 2}], agg#0=[sum($3)])
              HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[not available])
                HiveProject(sr_returned_date_sk=[$0], sr_customer_sk=[$3], sr_store_sk=[$7], sr_fee=[$14])
                  HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL($7))])
                    HiveTableScan(table=[[default, store_returns]], table:alias=[store_returns])
                HiveProject(d_date_sk=[$0])
                  HiveFilter(condition=[AND(=($6, 2000), IS NOT NULL($0))])
                    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])

In turn, we can execute the following command:

Code Block

EXPLAIN CBO COST
WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk,
  sr_store_sk AS ctr_store_sk,
  SUM(SR_FEE) AS ctr_total_return
  FROM store_returns, date_dim
  WHERE sr_returned_date_sk = d_date_sk
    AND d_year =2000
  GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return > (SELECT AVG(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
  AND s_store_sk = ctr1.ctr_store_sk
  AND s_state = 'NM'
  AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100

It will produce a similar plan, but the cost for each operator will be embedded next to the operator descriptors:

Code Block

CBO PLAN:
HiveSortLimit(sort0=[$0], dir0=[ASC], fetch=[100]): rowcount = 100.0, cumulative cost = {2.395588892021712E26 rows, 1.197794434438787E26 cpu, 0.0 io}, id = 1683
  HiveProject(c_customer_id=[$1]): rowcount = 1.1977944344387866E26, cumulative cost = {2.395588892021712E26 rows, 1.197794434438787E26 cpu, 0.0 io}, id = 1681
    HiveJoin(condition=[AND(=($3, $7), >($4, $6))], joinType=[inner], algorithm=[none], cost=[not available]): rowcount = 1.1977944344387866E26, cumulative cost = {1.1977944575829254E26 rows, 4.160211553874922E10 cpu, 0.0 io}, id = 1679
      HiveJoin(condition=[=($2, $0)], joinType=[inner], algorithm=[none], cost=[not available]): rowcount = 2.3144135067474273E18, cumulative cost = {2.3144137967122499E18 rows, 1.921860676139634E10 cpu, 0.0 io}, id = 1663
        HiveProject(c_customer_sk=[$0], c_customer_id=[$1]): rowcount = 7.2E7, cumulative cost = {2.24E8 rows, 3.04000001E8 cpu, 0.0 io}, id = 1640
          HiveFilter(condition=[IS NOT NULL($0)]): rowcount = 7.2E7, cumulative cost = {1.52E8 rows, 1.60000001E8 cpu, 0.0 io}, id = 1638
            HiveTableScan(table=[[default, customer]], table:alias=[customer]): rowcount = 8.0E7, cumulative cost = {8.0E7 rows, 8.0000001E7 cpu, 0.0 io}, id = 1055
        HiveJoin(condition=[=($3, $1)], joinType=[inner], algorithm=[none], cost=[not available]): rowcount = 2.1429754692105807E11, cumulative cost = {2.897408225471977E11 rows, 1.891460676039634E10 cpu, 0.0 io}, id = 1661
          HiveProject(sr_customer_sk=[$0], sr_store_sk=[$1], $f2=[$2]): rowcount = 6.210443022113779E9, cumulative cost = {7.544327346205959E10 rows, 1.891460312135634E10 cpu, 0.0 io}, id = 1685
            HiveAggregate(group=[{1, 2}], agg#0=[sum($3)]): rowcount = 6.210443022113779E9, cumulative cost = {6.92328304399458E10 rows, 2.8327405501500005E8 cpu, 0.0 io}, id = 1654
              HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[not available]): rowcount = 6.2104430221137794E10, cumulative cost = {6.2246082040067795E10 rows, 2.8327405501500005E8 cpu, 0.0 io}, id = 1652
                HiveProject(sr_returned_date_sk=[$0], sr_customer_sk=[$3], sr_store_sk=[$7], sr_fee=[$14]): rowcount = 4.198394835000001E7, cumulative cost = {1.4155904670000002E8 rows, 2.8311809440000004E8 cpu, 0.0 io}, id = 1645
                  HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL($7), IS NOT NULL($3))]): rowcount = 4.198394835000001E7, cumulative cost = {9.957509835000001E7 rows, 1.15182301E8 cpu, 0.0 io}, id = 1643
                    HiveTableScan(table=[[default, store_returns]], table:alias=[store_returns]): rowcount = 5.759115E7, cumulative cost = {5.759115E7 rows, 5.7591151E7 cpu, 0.0 io}, id = 1040
                HiveProject(d_date_sk=[$0]): rowcount = 9861.615, cumulative cost = {92772.23000000001 rows, 155960.615 cpu, 0.0 io}, id = 1650
                  HiveFilter(condition=[AND(=($6, 2000), IS NOT NULL($0))]): rowcount = 9861.615, cumulative cost = {82910.615 rows, 146099.0 cpu, 0.0 io}, id = 1648
                    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim]): rowcount = 73049.0, cumulative cost = {73049.0 rows, 73050.0 cpu, 0.0 io}, id = 1043
          HiveProject(s_store_sk=[$0]): rowcount = 230.04000000000002, cumulative cost = {2164.08 rows, 3639.04 cpu, 0.0 io}, id = 1659
            HiveFilter(condition=[AND(=($24, _UTF-16LE'NM'), IS NOT NULL($0))]): rowcount = 230.04000000000002, cumulative cost = {1934.04 rows, 3409.0 cpu, 0.0 io}, id = 1657
              HiveTableScan(table=[[default, store]], table:alias=[store]): rowcount = 1704.0, cumulative cost = {1704.0 rows, 1705.0 cpu, 0.0 io}, id = 1050
      HiveProject(_o__c0=[*(/($1, $2), 1.2)], ctr_store_sk=[$0]): rowcount = 6.900492246793088E8, cumulative cost = {8.537206083312463E10 rows, 2.2383508777352882E10 cpu, 0.0 io}, id = 1677
        HiveAggregate(group=[{1}], agg#0=[sum($2)], agg#1=[count($2)]): rowcount = 6.900492246793088E8, cumulative cost = {8.468201160844533E10 rows, 2.1003410327994267E10 cpu, 0.0 io}, id = 1675
          HiveProject(sr_customer_sk=[$0], sr_store_sk=[$1], $f2=[$2]): rowcount = 6.900492246793088E9, cumulative cost = {8.381945007759619E10 rows, 2.1003410327994267E10 cpu, 0.0 io}, id = 1686
            HiveAggregate(group=[{1, 2}], agg#0=[sum($3)]): rowcount = 6.900492246793088E9, cumulative cost = {7.69189578308031E10 rows, 3.01933587615E8 cpu, 0.0 io}, id = 1673
              HiveJoin(condition=[=($0, $4)], joinType=[inner], algorithm=[none], cost=[not available]): rowcount = 6.900492246793088E10, cumulative cost = {6.915590405316087E10 rows, 3.01933587615E8 cpu, 0.0 io}, id = 1671
                HiveProject(sr_returned_date_sk=[$0], sr_customer_sk=[$3], sr_store_sk=[$7], sr_fee=[$14]): rowcount = 4.66488315E7, cumulative cost = {1.50888813E8 rows, 3.01777627E8 cpu, 0.0 io}, id = 1667
                  HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT NULL($7))]): rowcount = 4.66488315E7, cumulative cost = {1.042399815E8 rows, 1.15182301E8 cpu, 0.0 io}, id = 1665
                    HiveTableScan(table=[[default, store_returns]], table:alias=[store_returns]): rowcount = 5.759115E7, cumulative cost = {5.759115E7 rows, 5.7591151E7 cpu, 0.0 io}, id = 1040
                HiveProject(d_date_sk=[$0]): rowcount = 9861.615, cumulative cost = {92772.23000000001 rows, 155960.615 cpu, 0.0 io}, id = 1650
                  HiveFilter(condition=[AND(=($6, 2000), IS NOT NULL($0))]): rowcount = 9861.615, cumulative cost = {82910.615 rows, 146099.0 cpu, 0.0 io}, id = 1648
                    HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim]): rowcount = 73049.0, cumulative cost = {73049.0 rows, 73050.0 cpu, 0.0 io}, id = 1043

The AST Clause

Outputs the query's Abstract Syntax Tree.

Example:

Code Block
EXPLAIN AST FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

Outputs:

Code Block

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))

The DEPENDENCY Clause

The use of DEPENDENCY in the EXPLAIN statement produces extra information about the inputs in the plan. It shows various attributes for the inputs. For example, for a query like:

Code Block
EXPLAIN DEPENDENCY SELECT key, count(1) FROM srcpart WHERE ds IS NOT NULL GROUP BY key

the following output is produced:

Code Block

{"input_partitions":[{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=12"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=12"}],"input_tables":[{"tablename":"default@srcpart","tabletype":"MANAGED_TABLE"}]}

The inputs contain both the tables and the partitions. Note that the table is present even if none of the partitions is accessed in the query.

The dependencies show the parents in case a table is accessed via a view. Consider the following queries:

Code Block
CREATE VIEW V1 AS SELECT key, value from src; EXPLAIN DEPENDENCY SELECT * FROM V1;

The following output is produced:

Code Block
{"input_partitions":[],"input_tables":[{"

Code Block

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))

...

The use of DEPENDENCY in the EXPLAIN statement produces extra information about the inputs in the plan. It shows various attributes for the inputs. For example, for a query like:

Code Block
EXPLAIN DEPENDENCY SELECT key, count(1) FROM srcpart WHERE ds IS NOT NULL GROUP BY key

the following output is produced:

Code Block

{"input_partitions":[{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-08/hr=12"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=11"},{"partitionName":"default<at:var at:name="srcpart" />ds=2008-04-09/hr=12"}],"input_tables":[{"tablename":"default@srcpart","tabletype":"MANAGED_TABLE"}]}

The inputs contain both the tables and the partitions. Note that the table is present even if none of the partitions is accessed in the query.

The dependencies show the parents in case a table is accessed via a view. Consider the following queries:

Code Block
CREATE VIEW V1 AS SELECT key, value from src; EXPLAIN DEPENDENCY SELECT * FROM V1;

The following output is produced:

Code Block
{"input_partitions":[],"input_tables":[{"tablename":"default@v1","tabletype":"VIRTUAL_VIEW"},{"tablename":"default@src","tabletype":"MANAGED_TABLE","tableParents":"[default@v1]"}]}

...

With the FORMATTED keyword, it will be returned in JSON format.

...

Code Block

"OUTPUTS":["hdfs://localhost:9000/tmp/.../-mr-10000"],"INPUTS":["default@srcpart","default@src","default@srcpart@ds=2008-04-08/hr=11","default@srcpart@ds=2008-04-08/hr=12","default@srcpart@ds=2008-04-09/hr=11","default@srcpart@ds=2008-04-09/hr=12"],"OPERATION":"QUERY","CURRENT_USER":"navis","AUTHORIZATION_FAILURES":["Permission denied: Principal [name=navis, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=default.src], [SELECT] on Object [type=TABLE_OR_VIEW, name=default.srcpart]]"]}

The LOCKS Clause

This is useful to understand what locks the system will acquire to run the specified query. Since Hive release 3.2.0 (HIVE-17683).

For example

Code Block
EXPLAIN LOCKS UPDATE target SET b = 1 WHERE p IN (SELECT t.q1 FROM source t WHERE t.a1=5)

Will produce output like this.

Code Block

LOCK INFORMATION:
default.source -> SHARED_READ
default.target.p=1/q=2 -> SHARED_READ
default.target.p=1/q=3 -> SHARED_READ
default.target.p=2/q=2 -> SHARED_READ
default.target.p=2/q=2 -> SHARED_WRITE
default.target.p=1/q=3 -> SHARED_WRITE
default.target.p=1/q=2 -> SHARED_WRITE

Code Block
EXPLAIN FORMATTED LOCKS <sql>

is also supported which will produce JSON encoded output.

The VECTORIZATION Clause

Adds detail to the EXPLAIN output showing why Map and Reduce work is not vectorized. Since Hive release 2.3.0 (HIVE-11394).

Syntax: EXPLAIN VECTORIZATION [ONLY] [SUMMARY|OPERATOR|EXPRESSION|DETAIL]

ONLY option suppresses most non-vectorization elements.
SUMMARY (default) shows vectorization information for the PLAN (is vectorization enabled) and a summary of Map and Reduce work.
OPERATOR shows vectorization information for operators. E.g. Filter Vectorization. Includes all information of SUMMARY.
EXPRESSION shows vectorization information for expressions. E.g. predicateExpression. Includes all information of SUMMARY and OPERATOR.
DETAIL shows detail-level vectorization information. It includes all information of SUMMARY, OPERATOR, and EXPRESSION.

The optional clause defaults are not ONLY and SUMMARY.

...

["Permission denied: Principal [name=navis, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=default.src], [SELECT] on Object [type=TABLE_OR_VIEW, name=default.srcpart]]"]}

The LOCKS Clause

This is useful to understand what locks the system will acquire to run the specified query. Since Hive release 3.2.0 (HIVE-17683).

For example

Code Block
EXPLAIN LOCKS UPDATE target SET b = 1 WHERE p IN (SELECT t.q1 FROM source t WHERE t.a1=5)

Will produce output like this.

Code Block

LOCK INFORMATION:
default.source -> SHARED_READ
default.target.p=1/q=2 -> SHARED_READ
default.target.p=1/q=3 -> SHARED_READ
default.target.p=2/q=2 -> SHARED_READ
default.target.p=2/q=2 -> SHARED_WRITE
default.target.p=1/q=3 -> SHARED_WRITE
default.target.p=1/q=2 -> SHARED_WRITE

Code Block
EXPLAIN FORMATTED LOCKS <sql>

is also supported which will produce JSON encoded output.

The VECTORIZATION Clause

Adds detail to the EXPLAIN output showing why Map and Reduce work is not vectorized. Since Hive release 2.3.0 (HIVE-11394).

Syntax: EXPLAIN VECTORIZATION [ONLY] [SUMMARY|OPERATOR|EXPRESSION|DETAIL]

ONLY option suppresses most non-vectorization elements.
SUMMARY (default) shows vectorization information for the PLAN (is vectorization enabled) and a summary of Map and Reduce work.
OPERATOR shows vectorization information for operators. E.g. Filter Vectorization. Includes all information of SUMMARY.
EXPRESSION shows vectorization information for expressions. E.g. predicateExpression. Includes all information of SUMMARY and OPERATOR.
DETAIL shows detail-level vectorization information. It includes all information of SUMMARY, OPERATOR, and EXPRESSION.

The optional clause defaults are not ONLY and SUMMARY.

See HIVE-11394 for more details and examples.

The ANALYZE Clause

Annotates the plan with actual row counts. Since in Hive 2.2.0 (HIVE-14362)

Format is: (estimated row count) / (actual row count)

Example:

For the below tablescan; the estimation was 500 rows; but actually the scan only yielded 13 rows.

Code Block
[...] TableScan [TS_13] (rows=500/13 width=178) Output:["key","value"] [...]

User-level Explain Output

...

Space shortcuts

Child pages

Versions Compared

Old Version 27

New Version Current

Key

The

CBO Clause

The AST Clause

The DEPENDENCY Clause

The LOCKS Clause

The VECTORIZATION Clause

The LOCKS Clause

The VECTORIZATION Clause

The ANALYZE Clause

User-level Explain Output