Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added sample monolithic query.

...

Hive is widely applied as solution to numerous distinct problem types in the domain of big data. Quite clearly it often used for the adhoc ad hoc querying of large datasets. However it is also used to implement ETL type processes. Unlike ad hoc queries, the HQL written for ETLs has some distinct attributes:

...

Unlike column level logic, it is much less obvious how best to encapsulate and compose collections of set based logic. Options include: . Consider the following example of a single complex query comprising joins, groupings, and column projections:

Code Block
languagesql
titleMonolithic query
SELECT ... FROM (                  -- Query 1
  SELECT ... FROM (                --  Query 2
    SELECT ... FROM (              --   Query 3
      SELECT ... FROM a WHERE ...  --    Query 4
    ) A LEFT JOIN (                --   Query 3
      SELECT ... FROM b            --    Query 5
    ) B ON (...)                   --   Query 3 
  ) ab FULL OUTER JOIN (           --  Query 2
    SELECT ... FROM c WHERE ...    --   Query 6
  ) C ON (...)                     --  Query 2
) abc LEFT JOIN (                  -- Query 1
  SELECT ... FROM d WHERE ...      --  Query 7
) D ON (...)                       -- Query 1
GROUP BY ...;                      -- Query 1

This query has a very broad set of responsibilities which cannot be easily verified in isolation. On closer inspection it appears that it is in fact formed of at least 7 distinct queries. To effectively unit test the process that this query represents an approach must be applied that separates and encapsulates each of the subqueries so that they can be tested independently. Possibly approaches to this include: VIEWs, sequential execution of components with intermediate (possibly TEMPORARY) tables, and even variable substitution of query fragments.

...