...
Code Block |
---|
join_table: table_reference [INNER] JOIN table_factor [join_condition] | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition | table_reference LEFT SEMI JOIN table_reference join_condition | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10) table_reference: table_factor | join_table table_factor: tbl_name [alias] | table_subquery alias | ( table_references ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression |
Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.
See Select Syntax for the context of this join syntax.
See Select Syntax for the context of this join syntax.
Info | ||
---|---|---|
| ||
Info | ||
| ||
Implicit join notation is supported starting with Hive 0.13.0 (see HIVE-5558). This allows the FROM clause to join a comma-separated list of tables, omitting the JOIN keyword. For example:
|
Info | ||
---|---|---|
| ||
Unqualified column references are supported in join conditions, starting with Hive 0.13.0 (see HIVE-6393). Hive attempts to resolve these against the inputs to a Join. If an unqualified column reference resolves to more than one table, Hive will flag it as an ambiguous reference. For example:
|
Info | ||
---|---|---|
| ||
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions. In particular, syntax for join conditions was restricted as follows: join_condition: equality_expression: |
Examples
Some salient points to consider when writing join queries are as follows:
Only equality joins Complex join expressions are allowed e.g.
Code Block SELECT a.* FROM a JOIN b ON (a.id = b.id)
Code Block SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
are both valid joins, however
Code Block SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)
are valid joinsis NOT allowed.
More than 2 tables can be joined in the same query e.g.
Code Block SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
is a valid join.
...
LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is are that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.
Code Block SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B);
can be rewritten to:
Code Block SELECT a.key, a.val FROM a LEFT SEMI JOIN b ON (a.key = b.key)
...