Several approaches can be considered to handle this problem. If we have reliable statistics about the data, like detailed histograms, we can process rows with the same value using a new join algorithm (using or dropping rows sharing the same value all in one shot). Or we can divide the rows into pieces such that each piece can fit into memory, and perform the probing in several passes. This probably will bring performance impact. Further, we can resort to regular shuffle join as a fallback option once we figure out Mapjoin cannot handle this situation.

Bloom Filter

A As of Hive 2.0.0, a cheap Bloom filter is built during the build phase of the Hybrid hashtable, which is consulted against before spilling a row into the matchfile. The goal is to minimize the number of records which end up being spilled to disk, which may not have any matches in the spilled hashtables. The optimization also benefits left outer joins since the row which entered the hybrid join can be immediately generated as output with appropriate nulls indicating a lack of match, while without the filter it would have to be serialized onto disk only to be reloaded without a match at the end of the probe.

...

Space shortcuts

Child pages

Versions Compared

Old Version 6

New Version Current

Key

Bloom Filter

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 6

New Version Current

Key

Bloom Filter