Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM A a join B b on a.key = b.key
    
    can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
    Code Block
      set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
      set hive.optimize.bucketmapjoin = true;
      set hive.optimize.bucketmapjoin.sortedmerge = true;
    

Mapjoin restrictions

...

  • If all but one of the tables being joined are small, the join can be performed as a map only job. The query
    Code Block
      SELECT /*+ MAPJOIN(b) */ a.key, a.value
      FROM a join b on a.key = b.key
    
    does not need a reducer. For every mapper of A, B is read completely.
  • The following is not supported.
    • Union Followed by a Mapjoin
    • Lateral View Followed by a Mapjoin
    • Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
    • Mapjoin Followed by Union
    • Mapjoin Followed by join
    • Mapjoin Followed by Mapjoin
  • The configuration variable hive.auto.convert.join (if set to true) automatically converts the joins to mapjoins at runtime if possible, and it should be used instead of the mapjoin hint. The mapjoin hint should only be used for the following query.
    • If all the inputs are bucketed or sorted, and the join should be converted to a bucketized map-side join or bucketized sort-merge join.
  • Consider the possibility of multiple map-joins on different keys:
    Code Block
    
    select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
      ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
        bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                   
      ) firstjoin                                                             
      JOIN                                                                  
      smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo)                       
    

Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job.

    • hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
    • hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB