You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Problem

Currently HIVE doesn't support subqueries in SELECT statement e.g. following query will not run on HIVE

SELECT customer.customer_num,
	(SELECT SUM(ship_charge) 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 

Recently a lot of work has been done to extend support for subqueries(HIVE-15456). But this work primarily targeted to extend subquery support in WHERE and HAVING clause. We plan to continue the work done in HIVE-15456 to support subqueries in SELECT list.

 

Assumptions

We plan to limit the scope with following assumption/limitations

  • Subqueries could only be top level expressions in select i.e. subqueries in complex expressions, aggregates, UDFs etc will not be supported for now. e.g. following queries will not run on HIVE

    -- subquery in non-simple expression
    SELECT 1 + (SELECT SUM(ship_charge) FROM orders), customer.customer_num FROM customer
     
    -- subquery in CASE
    SELECT CASE WHEN (select count(*) from store_sales 
                      where ss_quantity between 1 and 20) > 409437
                THEN (select avg(ss_ext_list_price) 
                      from store_sales 
                      where ss_quantity between 1 and 20) 
                ELSE (select avg(ss_net_paid_inc_tax)
                      from store_sales
                      where ss_quantity between 1 and 20) end bucket1
    FROM reason
    WHERE r_reason_sk = 1
  • Scalar subquery can only return at most one row. HIVE will check for this case at runtime and throw an error if not satisfied. e.g. following query is invalid

    SELECT customer.customer_num,
    	(SELECT ship_charge 
    		FROM orders
    		WHERE customer.customer_num = orders.customer_num
    	) AS total_ship_chg
    FROM customer 
  • Scalar subquery can only have one column. HIVE will check for this case during compilation and throw an error. e.g following query is invalid

    SELECT customer.customer_num,
    	(SELECT ship_charge, customer_num
    		FROM orders LIMIT 1
    	) AS total_ship_chg
    FROM customer

Design

Given the assumptions above following kind of subqueries could be used in select 

  • Scalar subqueries e.g. 

    SELECT customer.customer_num,
    	(SELECT SUM(ship_charge) 
    		FROM orders
    		WHERE customer.customer_num = orders.customer_num
    	) AS total_ship_chg
    FROM customer 
  • IN subqueries

    SELECT p_size IN (
    		SELECT MAX(p_size) FROM part)
    FROM part
  • EXISTS subqueries

    SELECT EXISTS(SELECT p_size FROM part)
    FROM part

All of the above queries could be correlated or uncorrelated.

 

  • No labels