Page History

...

Hive is widely applied as solution to numerous distinct problem types in the domain of big data. Quite clearly it often used for the adhoc ad hoc querying of large datasets. However it is also used to implement ETL type processes. Unlike ad hoc queries, the HQL written for ETLs has some distinct attributes:

...

Unlike column level logic, it is much less obvious how best to encapsulate and compose collections of set based logic. Options include: . Consider the following example of a single complex query comprising joins, groupings, and column projections:

Code Block

language	sql
title	Monolithic query

SELECT ... FROM (                  -- Query 1
  SELECT ... FROM (                --  Query 2
    SELECT ... FROM (              --   Query 3
      SELECT ... FROM a WHERE ...  --    Query 4
    ) A LEFT JOIN (                --   Query 3
      SELECT ... FROM b            --    Query 5
    ) B ON (...)                   --   Query 3 
  ) ab FULL OUTER JOIN (           --  Query 2
    SELECT ... FROM c WHERE ...    --   Query 6
  ) C ON (...)                     --  Query 2
) abc LEFT JOIN (                  -- Query 1
  SELECT ... FROM d WHERE ...      --  Query 7
) D ON (...)                       -- Query 1
GROUP BY ...;                      -- Query 1

This query has a very broad set of responsibilities which cannot be easily verified in isolation. On closer inspection it appears that it is in fact formed of at least 7 distinct queries. To effectively unit test the process that this query represents an approach must be applied that separates and encapsulates each of the subqueries so that they can be tested independently. Possibly approaches to this include: VIEWs, sequential execution of components with intermediate (possibly TEMPORARY) tables, and even variable substitution of query fragments.

...

Space shortcuts

Child pages

Versions Compared

Old Version 3

New Version 4

Key