You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Optimizing Skewed Joins

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer. If the key is highly skewed, this can cause certain Reducers to become the bottleneck. If we have the skew information, we can remove this bottleneck as follows:

For keys which are not skewed, the join can be done as usual. For the skewed keys, we can do a join with only those rows in opposite table which have this key.

For e.g.,
Suppose table A has keys 1,2,3 and 4, of which 1 is skewed. We want to join this with table B which has keys 1, 2 and 3. Currently, it will be done as follows: A set of mappers read both the tables and pass on each key to a separate reducer. Since keys 2 and 3 are small, they will complete their job quickly. But since key 1 is skewed, it will continue running even after the others have completed.

To improve this, we will first do the usual join of A without 1 and B, and store it in tmp1. Then we will join the key 1 of A with key 1 of B and store it in tmp2. The partial results, tmp1 and tmp2 are then merged to get the final answer. Here, we do not have to restrict key 1 to a single reducer. So it doesn't become the bottleneck.

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

  • If it has key 1, then use the hashed version of B to compute the result.
  • For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.

  • No labels