Hive Outer Join Behavior

1 Definitions:

This based on writeup of DB2 Outer Join behavior.

Preserved Row table
In an Outer Join this is the table that must return all rows. So for left outer joins this is the Left table, for right outer joins, it is the Right table, and for full outer both tables are Preserved Row tables.
Null Supplying tables
this is the table that can that has nulls filled in for its columns. In the non full outer case, this is the other table in the Join. For full outer both tables are also Null Supplying tables.
During Join predicate
these are predicates that are in the Join On clause. For e.g. 'R1 join R2 on R1.x = 5'. The predicate 'R1.x = 5' is a During Join predicate
After Join predicate
these are predicates that are in the Where clause.

2 Predicate Pushdown Rules

The logic can be summarized by these 2 rules:

  1. During Join Predicates cannot be pushed past Preserved Row tables.
  2. After Join Predicates cannot be pushed past Null Supplying tables.

This captured in the following table:

  Preserved Row Table Null Supplying Table
Join Case J1: Case J2:
Predicate Not Pushed Pushed
Where Case W1: Case W2:
Predicate Pushed Not Pushed

2.1 Hive implementation

Hive enforces these Rules by:

Rule 1
During QBJoinTree construction in Plan Gen, the parse Join Condition logic applies this rule.
Rule 2
During JoinPPD (Join Predicate Pushdown) the get Qualified Alias logic applies this rule.

3 Examples

Given Src(Key String, Value String) the following Left outer Join examples show that Hive has the correct behavior.

3.1 Case J1: Join Predicate on Preserved Row table

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s1.key > '2';

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  filter predicates:
	    0 {(VALUE._col0 > '2')}
	    1
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

3.2 Case J2: Join Predicate on Null Supplying table

explain 
select s1.key, s2.key 
from src s1 left join src s2 on s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 1
		value expressions:
		      expr: key
		      type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

3.3 Case W1: Where Predicate on Preserved Row table

explain 
select s1.key, s2.key 
from src s1 left join src s2 
where s1.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Filter Operator
	      predicate:
		  expr: (key > '2')
		  type: boolean
	      Reduce Output Operator
		sort order:
		tag: 0
		value expressions:
		      expr: key
		      type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Select Operator
	    expressions:
		  expr: _col0
		  type: string
		  expr: _col4
		  type: string
	    outputColumnNames: _col0, _col1
	    File Output Operator
	      compressed: false
	      GlobalTableId: 0
	      table:
		  input format: org.apache.hadoop.mapred.TextInputFormat
		  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

3.4 Case W2: Where Predicate on Null Supplying table

explain
select s1.key, s2.key 
from src s1 left join src s2 
where s2.key > '2';

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
	s1
	  TableScan
	    alias: s1
	    Reduce Output Operator
	      sort order:
	      tag: 0
	      value expressions:
		    expr: key
		    type: string
	s2
	  TableScan
	    alias: s2
	    Reduce Output Operator
	      sort order:
	      tag: 1
	      value expressions:
		    expr: key
		    type: string
      Reduce Operator Tree:
	Join Operator
	  condition map:
	       Left Outer Join0 to 1
	  condition expressions:
	    0 {VALUE._col0}
	    1 {VALUE._col0}
	  handleSkewJoin: false
	  outputColumnNames: _col0, _col4
	  Filter Operator
	    predicate:
		expr: (_col4 > '2')
		type: boolean
	    Select Operator
	      expressions:
		    expr: _col0
		    type: string
		    expr: _col4
		    type: string
	      outputColumnNames: _col0, _col1
	      File Output Operator
		compressed: false
		GlobalTableId: 0
		table:
		    input format: org.apache.hadoop.mapred.TextInputFormat
		    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
		    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Author: Harish Butani

Created: 2013-11-27 Wed 15:42

Emacs 24.3.1 (Org mode 8.0.3)

Validate XHTML 1.0