Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix broken link

...

No Format
set hive.auto.convert.join=true;
select count(*) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

The default value for hive.auto.convert.join was false in Hive 0.10.0.  Hive 0.11.0 changed the default to true (HIVE-3297). Note that hive-default.xml.template incorrectly gives the default as false in Hive 0.11.0 through 0.13.1.

...

  1. Merge M*-MR patterns into a single MR.
  2. Merge MJ->MJ into a single MJ when possible.
  3. Merge MJ* patterns into a single Map stage as a chain of MJ operators. (Not yet implemented.)

If hive.auto.convert.join is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible.

...

Info
titleVersion 0.13.0 and later

Hive 0.13.0 introduced hive.auto.convert.join.use.nonstaged with a default of false (HIVE-6144).

For conditional joins, if the input stream from a small alias can be directly applied to the join operator without filtering or projection, then it does not need to be pre-staged in the distributed cache via a MapReduce local task. Setting hive.auto.convert.join.use.nonstaged to true will avoid avoids pre-staging in those cases.

...

Current Optimization
  1. Group as many MJ operators as possible into one MJ.

...

No Format
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

There is an option to set the big table selection policy using the following configuration:

...

The names describe their uses. This is especially useful for the fact-fact join (query 82 in the TPC DS benchmark).

SMB Join across Tables with Different Keys

If the tables have differing number of keys, for example Table A has 2 SORT columns and Table B has 1 SORT column, then you might get an index out of bounds exception.

The following query results in an index out of bounds exception because emp_person let us say for example has 1 sort column while emp_pay_history has 2 sort columns.

Code Block
languagesql
titleError Hive 0.11
SELECT p.*, py.*
FROM emp_person p INNER JOIN emp_pay_history py
ON   p.empid = py.empid

This works fine.

Code Block
languagesql
titleWorking query Hive 0.11
SELECT p.*, py.*
FROM emp_pay_history py INNER JOIN emp_person p
ON   p.empid = py.empid

 

Generate Hash Tables on the Task Side

...