THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Page History

Versions Compared

Old Version 4

changes.mady.by.user Unknown User (ruakh)

Saved on Jul 18, 2011

compared with

New Version 5

changes.mady.by.user Steven Wong

Saved on Nov 23, 2011

Key

This line was added.
This line was removed.
Formatting was changed.

Comment: Clarify some join prerequisites.

...

If the tables being joined are bucketized on the join columns, and the buckets are number of buckets in one table is a multiple of each otherof the number of buckets in the other table, the buckets can be joined with each other. If table A has 8 4 buckets are and table B has 4 buckets, the following join
Code Block
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
Code Block
set hive.optimize.bucketmapjoin = true

If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets are same, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
Code Block
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM A a join B b on a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
Code Block
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;