Problem
Currently HIVE doesn't support subqueries in SELECT statement e.g. following query will not run on HIVE
SELECT customer.customer_num, (SELECT SUM(ship_charge) FROM orders WHERE customer.customer_num = orders.customer_num ) AS total_ship_chg FROM customer
Recently a lot of work has been done to extend support for subqueries(HIVE-15456). But this work primarily targeted to extend subquery support in WHERE and HAVING clause. We plan to continue the work done in HIVE-15456 to support subqueries in SELECT list.
Assumptions
We plan to limit the scope with following assumption/limitations
Subqueries could only be top level expressions in select i.e. subqueries in complex expressions, aggregates, UDFs etc will not be supported for now. e.g. following queries will not run on HIVE
-- subquery in non-simple expression SELECT 1 + (SELECT SUM(ship_charge) FROM orders), customer.customer_num FROM customer -- subquery in CASE SELECT CASE WHEN (select count(*) from store_sales where ss_quantity between 1 and 20) > 409437 THEN (select avg(ss_ext_list_price) from store_sales where ss_quantity between 1 and 20) ELSE (select avg(ss_net_paid_inc_tax) from store_sales where ss_quantity between 1 and 20) end bucket1 FROM reason WHERE r_reason_sk = 1
Scalar subquery can only return at most one row. HIVE will check for this case at runtime and throw an error if not satisfied. e.g. following query is invalid
SELECT customer.customer_num, (SELECT ship_charge FROM orders WHERE customer.customer_num = orders.customer_num ) AS total_ship_chg FROM customer
Scalar subquery can only have one column. HIVE will check for this case during compilation and throw an error. e.g following query is invalid
SELECT customer.customer_num, (SELECT ship_charge, customer_num FROM orders LIMIT 1 ) AS total_ship_chg FROM customer
Design
Given the assumptions above following kind of subqueries could be used in select
Scalar subqueries e.g.
SELECT customer.customer_num, (SELECT SUM(ship_charge) FROM orders WHERE customer.customer_num = orders.customer_num ) AS total_ship_chg FROM customer
IN subqueries
SELECT p_size IN ( SELECT MAX(p_size) FROM part) FROM part
EXISTS subqueries
SELECT EXISTS(SELECT p_size FROM part) FROM part
All of the above queries could be correlated or uncorrelated.
Design for this will be similar to the work done in HIVE-15456
- genLogicalPlan will go over select list to do the following:
- If subquery is not top level expression throw an error
- Otherwise, generate appropriate plan by using RexSubquery to represent subquery
- HiveSubqueryRemoveRule will then be applied to remove the RexSubquery node and rewrite the query into join.
- HiveRelDecorrelator::decorrelateQuery will then be used to decorrelate correlated queries.