Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add links to HIVE-16091

Problem

Currently HIVE Hive doesn't support subqueries in a SELECT statement e.g. , for example, the following query will not run on HIVEHive:

Code Block
languagesql
SELECT customer.customer_num,
	(SELECT SUM(ship_charge) 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 

Recently a lot of work has been done to extend support for subqueries (HIVE-15456). But this work primarily targeted to extend extending subquery support in WHERE and HAVING clauseclauses. We plan to continue the work done in HIVE-15456 to support subqueries in SELECT lista select list (see HIVE-16091).

 

Assumptions

We plan to limit the scope with the following assumption/assumptions and limitations.

  • Subqueries could only be top-level expressions in select i.e. SELECT. That is, subqueries in complex expressions, aggregates, UDFs, etc. will not be supported for now. e.g. For example the following queries will not run on HIVEHive:

    Code Block
    languagesql
    titleNot Supported
    -- subquery in non-simple expression
    SELECT 1 + (SELECT SUM(ship_charge) FROM orders), customer.customer_num FROM customer
     
    -- subquery in CASE
    SELECT CASE WHEN (select count(*) from store_sales 
                      where ss_quantity between 1 and 20) > 409437
                THEN (select avg(ss_ext_list_price) 
                      from store_sales 
                      where ss_quantity between 1 and 20) 
                ELSE (select avg(ss_net_paid_inc_tax)
                      from store_sales
                      where ss_quantity between 1 and 20) end bucket1
    FROM reason
    WHERE r_reason_sk = 1
  • Scalar subquery subqueries can only return at most one row. HIVE Hive will check for this case at runtime and throw an error if not satisfied. e.g. For example the following query is invalid:

    Code Block
    languagesql
    titleNot Supported
    SELECT customer.customer_num,
    	(SELECT ship_charge 
    		FROM orders
    		WHERE customer.customer_num = orders.customer_num
    	) AS total_ship_chg
    FROM customer 
  • Scalar subquery subqueries can only have one column. HIVE Hive will check for this case during compilation and throw an error. e.g For example the following query is invalid:

    Code Block
    languagesql
    titleNot Supported
    SELECT customer.customer_num,
    	(SELECT ship_charge, customer_num
    		FROM orders LIMIT 1
    	) AS total_ship_chg
    FROM customer
  • Correlated variables are only permitted in filter i.e. WHERE and a filter, that is, a WHERE or HAVING clause. e.g. For example the following query is invalid:

    Code Block
    languagesql
    titleNot Supported
    SELECT customer.customer_num,
    	(SELECT customer.customer_num 
    		FROM orders
    		WHERE customer.customer_num = orders.customer_num
    	) AS total_ship_chg
    FROM customer 
  • Subqueries with DISTINCT is are not allowed. Since DISTINCT <expression> will be evaluated as GROUP BY <expression>, subqueries with DISTINCT is are disallowed for now.

Design

Given the assumptions above, the following kind of subqueries could be used in select SELECT. 

  • Scalar subqueries e.g., for example: 

    Code Block
    languagesql
    SELECT customer.customer_num,
    	(SELECT SUM(ship_charge) 
    		FROM orders
    		WHERE customer.customer_num = orders.customer_num
    	) AS total_ship_chg
    FROM customer 
  • IN subqueries, for example:

    Code Block
    languagesql
    SELECT p_size IN (
    		SELECT MAX(p_size) FROM part)
    FROM part
  • EXISTS subqueries, for example:

    Code Block
    languagesql
    SELECT EXISTS(SELECT p_size FROM part)
    FROM part

...

Design for this will be similar to the work done in HIVEin HIVE-15456 15456.

  • genLogicalPlan will go over the select list to do the following:
    • If subquery is not a top-level expression, throw an error.
    • Otherwise, generate an appropriate plan by using RexSubquery to represent the subquery.
  • HiveSubqueryRemoveRule will then be applied to remove the RexSubquery node and rewrite the query into a join.
  • HiveRelDecorrelator::decorrelateQuery will then be used to decorrelate correlated queries.

 HIVE-16091 covers the initial work for supporting subqueries in SELECT.