...
No Format |
---|
set hive.auto.convert.join=true; select count(*) from store_sales join time_dim on (ss_sold_time_sk = t_time_sk) |
The default value for hive.auto.convert.join was false in Hive 0.10.0. Hive 0.11.0 changed the default to true (HIVE-3297). Note that hive-default.xml.template incorrectly gives the default as false in Hive 0.11.0 through 0.13.1.
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:
...
- Merge M*-MR patterns into a single MR.
- Merge MJ->MJ into a single MJ when possible.
- Merge MJ* patterns into a single Map stage as a chain of MJ operators. (Not yet implemented.)
If hive.auto.convert.join
is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible.
...
No Format |
---|
set hive.auto.convert.join.noconditionaltask = true; set hive.auto.convert.join.noconditionaltask.size = 1000010000000; |
The default for hive.auto.convert.join.noconditionaltask
is true which means auto conversion is enabled. (Originally the default was false – see HIVE-3784 – but it was changed to true by HIVE-4146 before Hive 0.11.0 was released.) false which means auto conversion is disabled.
The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory. Currently, n-1 tables of the join have to fit in memory for the map-join optimization to take effect. There is no check to see if the table is a compressed one or not and what the potential size of the table can be. The effect of this assumption on the results is discussed in the next section.
...
Auto join conversion also affects the sort-merge-bucket joins.
Info | ||
---|---|---|
| ||
Hive 0.13.0 introduced For conditional joins, if the input stream from a small alias can be directly applied to the join operator without filtering or projection, then it does not need to be pre-staged in the distributed cache via a MapReduce local task. Setting |
Current Optimization
- Group as many MJ operators as possible into one MJ.
...
No Format |
---|
set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; |
There is an option to set the big table selection policy using the following configuration:
...
The names describe their uses. This is especially useful for the fact-fact join (query 82 in the TPC DS benchmark).
SMB Join across Tables with Different Keys
If the tables have differing number of keys, for example Table A has 2 SORT columns and Table B has 1 SORT column, then you might get an index out of bounds exception.
The following query results in an index out of bounds exception because emp_person let us say for example has 1 sort column while emp_pay_history has 2 sort columns.
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT p.*, py.*
FROM emp_person p INNER JOIN emp_pay_history py
ON p.empid = py.empid |
This works fine.
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT p.*, py.*
FROM emp_pay_history py INNER JOIN emp_person p
ON p.empid = py.empid
|
Generate Hash Tables on the Task Side
...