Hive Outer Join Behavior
This document is based on a writeup of DB2 Outer Join Behavior.
Definitions
Preserved Row table |
The table in an Outer Join that must return all rows. |
Null Supplying table |
This is the table that has nulls filled in for its columns in unmatched rows. |
During Join predicate |
A predicate that is in the JOIN ON clause. |
After Join predicate |
A predicate that is in the WHERE clause. |
Predicate Pushdown Rules
The logic can be summarized by these two rules:
- During Join predicates cannot be pushed past Preserved Row tables.
- After Join predicates cannot be pushed past Null Supplying tables.
This captured in the following table:
|
Preserved Row Table |
Null Supplying Table |
---|---|---|
Join |
Case J1: |
Case J2: |
Where |
Case W1: |
Case W2: |
Hive Implementation
Hive enforces the predicate pushdown rules by:
Rule 1: During QBJoinTree construction in Plan Gen, the parse Join Condition logic applies this rule.
Rule 2: During JoinPPD (Join Predicate Pushdown) the get Qualified Alias logic applies this rule.
Examples
Given Src(Key String, Value String) the following Left Outer Join examples show that Hive has the correct behavior.
Case J1: Join Predicate on Preserved Row Table
explain select s1.key, s2.key from src s1 left join src s2 on s1.key > '2'; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s1 TableScan alias: s1 Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: string s2 TableScan alias: s2 Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} filter predicates: 0 {(VALUE._col0 > '2')} 1 handleSkewJoin: false outputColumnNames: _col0, _col4 Select Operator expressions: expr: _col0 type: string expr: _col4 type: string outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1
Case J2: Join Predicate on Null Supplying Table
explain select s1.key, s2.key from src s1 left join src s2 on s2.key > '2'; STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s1 TableScan alias: s1 Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: string s2 TableScan alias: s2 Filter Operator predicate: expr: (key > '2') type: boolean Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col4 Select Operator expressions: expr: _col0 type: string expr: _col4 type: string outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1
Case W1: Where Predicate on Preserved Row Table
explain select s1.key, s2.key from src s1 left join src s2 where s1.key > '2'; STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s1 TableScan alias: s1 Filter Operator predicate: expr: (key > '2') type: boolean Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: string s2 TableScan alias: s2 Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col4 Select Operator expressions: expr: _col0 type: string expr: _col4 type: string outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1
Case W2: Where Predicate on Null Supplying Table
explain select s1.key, s2.key from src s1 left join src s2 where s2.key > '2'; STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: s1 TableScan alias: s1 Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: string s2 TableScan alias: s2 Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col4 Filter Operator predicate: expr: (_col4 > '2') type: boolean Select Operator expressions: expr: _col0 type: string expr: _col4 type: string outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1