THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
Code Block
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM A a join B b on a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
Code Block
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;

Mapjoin restrictions.

If all but one of the tables being joined are small, the join can be performed as a map only job. The query
Code Block
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
does not need a reducer. For every mapper of A, B is read completely.

The following is not supported.
- Union Followed by a Mapjoin
- Lateral View Followed by a Mapjoin
- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
- Mapjoin Followed by Union
- Mapjoin Followed by join
- Mapjoin Followed by Mapjoin