...
- Only equality joins are allowed e.g.
Code Block |
---|
SELECT a.* FROM a JOIN b ON (a.id = b.id)
|
Code Block |
---|
SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)
|
are both valid joins, however Code Block |
---|
SELECT a.* FROM a JOIN b ON (a.id <> b.id)
|
is NOT allowed.
- More than 2 tables can be joined in the same query e.g.
Code Block |
---|
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
|
is a valid join.
...
- If all but one of the tables being joined are small, the join can be performed as a map only job. The query
Code Block |
---|
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a join b on a.key = b.key
|
does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.
- If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
Code Block |
---|
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a join b on a.key = b.key
|
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter Code Block |
---|
set hive.optimize.bucketmapjoin = true
|
- If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
Code Block |
---|
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a join B b on a.key = b.key
|
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set: Code Block |
---|
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
|
...
MapJoin Restrictions
- If all but one of the tables being joined are small, the join can be performed as a map only job. The query
Code Block |
---|
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a join b on a.key = b.key
|
does not need a reducer. For every mapper of A, B is read completely.
- The following is not supported.
- Union Followed by a MapjoinMapJoin
- Lateral View Followed by a MapjoinMapJoin
- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
- Mapjoin MapJoin Followed by UnionMapjoin
- MapJoin Followed by joinJoin
- Mapjoin MapJoin Followed by MapjoinMapJoin
- The configuration variable hive.auto.convert.join (if set to true) automatically converts the joins to mapjoins at runtime if possible, and it should be used instead of the mapjoin hint. The mapjoin hint should only be used for the following query.
- If all the inputs are bucketed or sorted, and the join should be converted to a bucketized map-side join or bucketized sort-merge join.
- Consider the possibility of multiple map-joins mapjoins on different keys:
Code Block |
---|
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)
) firstjoin
JOIN
smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo)
|
The above query is not supported. Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job. - hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
- hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.
Join Optimization (Hive Version 0.11)
See Join Optimization for information about enhancements to join optimization introduced in Hive version 0.11.0. The use of hints is de-emphasized in the enhanced optimizations (HIVE-3784 and related JIRAs).