Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: just a light edit, add table of contents & link to HIVE-9277

Table of Contents

Overview

We are proposing an enhanced hash join algorithm called “hybrid hybrid grace hash join”. We can benefit from this feature as illustrated below:

  • The query will not fail even if the estimated memory requirement is slightly wrong.

  • Expensive garbage collection overhead can be avoided when hash table grows.

  • Join execution using a Map join operator even though the small table doesn't fit in memory, as spilling some data from the build and probe sides will still be cheaper than having to shuffle the large fact table.

The design was based on Hadoop’s parallel processing capability and significant amount of memory available.

See HIVE-9277 for the current status of this work.

Scope

This new join algorithm will only work with Tez. It does not support Map Reduce currently.

Notation and

...

Assumptions

The goal is to compute the equi-join result of two relations labeled R and S. We assume R is the small table with smaller cardinality, and S is the big table with bigger cardinality.

Brief

...

Review on Hash Join Algorithms

Simple Hash

...

Join

Also known as Classic Hash Join, it is used when the hash table for R can entirely fit into the memory.

In general, all hash join algorithms described here have two phases: Build phase and Probe phase. During the build phase, a hash table is built into memory based on the joining column(s) from the small table R. During the probe phase, the big table S is scanned sequentially , and for each row of S, probe the hash table is probed for matching rows. If a match is found, output the pair, otherwise drop the row from S and continue scanning S.

...

  1. Small table R is scanned, and during the scan a hash function is used to distribute the rows into different output buffers (or partitions). The size of each output buffer should be specified as close as possible to the memory limit but no more than that. After R has been completely scanned, all output buffers are flushed to disk.

  2. Similarly, big table S is scanned, partitioned and flushed to disk the same way. Apparently, the The hash function used here is the same one as in the previous step.

  3. Load one partition of R into memory and build a hash table for it. This is the build phase.

  4. Hash each row of the corresponding partition of S, and probe for a match in R’s hash table. If a match is found, output the pair to the result, otherwise, proceed with the next row in the current S partition. This is the probe phase.

  5. Repeat loading and building hash table for partitions of R and probing partitions of S, until both are exhausted.

It can be seen there are extra partitioning steps in this algorithm (1 & 2). The rest of the steps are the same as Classic Hash Join, i.e., building and probing. One assumption here is all partitions of R can completely fit into the memory.

...

It’s obvious that GRACE Hash Join uses main memory as a staging area during the partitioning phase, which unconditionally scans both relations from disk, then partitions and puts them back to disk (Hybrid GRACE Hash Join puts one pair less), and finally read reads them back to find matches.

...

The key factor that will impact the performance of this algorithm is whether the data can be evenly distributed into different hash partitions. If we got have skewed values, which will result in a few very big partitions, then an extra partitioning step is needed to divide the big partitions down to many. This can happen recursively. Refer to “Spill Level” section Recursive Hashing and Spilling below for more details. 

Algorithm

Like other hash joins, there are a build phase and a probe phase. But there are several new terms to be defined.

  • Partition. Instead of putting all keys of the small table into a single hash table, they will be put into many hash tables, based on their hash value. Those hash tables are known as partitions. Partitions can be either in memory when initially created, or spilled on to disk when the memory is used up.

  • Sidefile. It is a row container on disk to hold rows from the small table targeted to a specific partition that has been already spilled to disk. There can be zero or one sidefile for each partition spilled on to disk.

  • Matchfile. It is a row container on disk to hold rows from the big table that has have possible matching keys with partitions spilled on to disk. It also has a one-to-one relationship to the partitions.

...

If a possible match is found, i.e., the key hash of big table falls into an on-disk partition, it will be put into an on-disk matchfile. No matching is done at this moment, but will be done later.

...

After the big table has been exhausted, all the in-memory partitions are no longer needed and purged. Because Purging is appropriate because those keys that should match have been output to result, and those that don’t match won’t be relevant to the on-disk structures.

...

There are cases when the hash function is not working well for distributing values evenly among the hash partitions, or the values themselves are skewed, which will result in very big sidefiles. At the time when the on-disk partition and sidefile are to be merged back into memory, the size of the newly combined hash partition will exceed the memory limit. In that case, another round of hashing and spilling will be is necessary. The process is illustrated below.

Assume there’s only 1GB memory space available for the join. Hash function h1 distributes keys into two partitions HT1 and HT2. At some point of in time, the memory is full and HT1 is moved to disk. Assume all the future keys all hash to HT1, so they go to Sidefile1.

...

As normal, during the probe phase , big table values will be either matched and put into result or not matched and put into a Matchfile.

...

After the big table is exhausted, all the values from big table should be either be output (matched) to the result, or be staged into matchfiles. It Then it is time to merge back pairs of on-disk hash partition and sidefile, and build a new in-memory hash table, and probe the corresponding matchfile against it.

...

Now we probe using Matchfile 1 against HT 3. Matching values goes go into result. Non-matching values will go to Matchfile 4. 


 


This process can continue recursively if the size of HT 4 plus size of Sidefile 4 is still greater than the memory limit, i.e., hashing HT 4 and Sidefile 4 using a third hash function h3, and probing Matchfile 4 using h3. In this example, the size of HT 4 plus size of Sidefile 4 is smaller than memory limit, so we are done.

Skewed Data Distribution

Sometimes it happens that all or most rows in one hash partition share the same value. In that case, no matter how we change the hash function, it won’t help anyway.

Several approaches can be considered to handle this problem. If we have reliable statistics about the data, like detailed histograms, we can process the rows with the same value using a new join algorithm (using or dropping rows sharing the same value all in one shot). Or we can divide the rows into pieces such that each piece can fit into memory, and perform the probing in several passes. This probably will bring performance impact. Further, we can resort to regular shuffle join as a fallback option once we figure out Mapjoin cannot handle this situation.

...

Bloom filter can be created during the build phase and used during the probe phase, which can greatly reduce the size of an on-disk matchfile. This part will be described in detail in the next version of this design doc.

...

  • Hybrid Hybrid Grace Hash Join presentation by Mostafa

  • MapJoinOptimization https://cwiki.apache.org/confluence/display/Hive/MapJoinOptimization 

  • HIVE-1641 add map joined table to distributed cache

  • HIVE-1642 Convert join queries to map-join based on size of table/row

  • Database Management Systems, 3rd ed

  • Kitsuregawa, M.  Application of Hash to Data Base Machine and Its Architecture

  • Shapiro, L. D.  Join Processing in Database Systems with Large Main Memories

  • Dewitt, David J.  Implementation techniques for main memory database systems

  • Jimmy Lin and Chris Dyer  Data-Intensive Text Processing with MapReduce