Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Clarify some join prerequisites.

...

  • If the tables being joined are bucketized on the join columns, and the buckets are number of buckets in one table is a multiple of each otherof the number of buckets in the other table, the buckets can be joined with each other. If table A has 8 4 buckets are and table B has 4 buckets, the following join
    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a join b on a.key = b.key
    
    can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
    Code Block
      set hive.optimize.bucketmapjoin = true
    
  • If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets are same, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM A a join B b on a.key = b.key
    
    can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
    Code Block
      set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
      set hive.optimize.bucketmapjoin = true;
      set hive.optimize.bucketmapjoin.sortedmerge = true;