Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM A a join B b on a.key = b.key
    
    can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
    Code Block
      set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
      set hive.optimize.bucketmapjoin = true;
      set hive.optimize.bucketmapjoin.sortedmerge = true;
    

Mapjoin restrictions.

  • If all but one of the tables being joined are small, the join can be performed as a map only job. The query
    Code Block
    
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a join b on a.key = b.key
    
    does not need a reducer. For every mapper of A, B is read completely.
  • The following is not supported.
    • Union Followed by a Mapjoin
    • Lateral View Followed by a Mapjoin
    • Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
    • Mapjoin Followed by Union
    • Mapjoin Followed by join
    • Mapjoin Followed by Mapjoin