Skewed Join Optimization

Optimizing Skewed Joins

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer. If the key is highly skewed, this can cause certain Reducers to become the bottleneck. If we have the skew information, we can remove this bottleneck as follows:

For keys which are not skewed, the join can be done as usual. For the skewed keys, we can do a join with only those rows in opposite table which have this key.

For e.g.,
Suppose table A has keys 1,2,3 and 4, of which 1 is skewed. We want to join this with table B which has keys 1, 2 and 3. We will first do the usual join of A without 1 and B and store it in tmp1. Then we will join the key 1 of A with key 1 of B and store it in tmp2. The partial results tmp1 and tmp2 can be merged to get the final answer.

The assumption is that B has few rows with key 1. So these rows can be loaded into the memory.

Writing and reading the partial results is redundant. We can avoid this by proceeding as follows: First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

If it has key 1, then use the hashed version of B to compute the result.
For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we only end up reading B twice.

Space shortcuts

Child pages

Optimizing Skewed Joins