Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For e.g.,
Suppose table A has keys 1,2,3 and 4, of which 1 is skewed. We want to join this with table B which has keys 1, 2 and 3. Currently, it will be done as follows: A set of mappers read both the tables and pass on each key to a separate reducer. Since keys 2 and 3 are small, they will complete their job quickly. But since key 1 is skewed, it will continue running even after the others have completed.

To improve this, we We will first do the usual join of A without 1 and B, and store it in tmp1. Then we will join the key 1 of A with key 1 of B and store it in tmp2. The partial results, tmp1 and tmp2 can be are then merged to get the final answer.

The assumption is that B has few rows with key 1. So these rows can be loaded into the memory.

Here, we do not have to restrict key 1 to a single reducer. So it doesn't become the bottleneck.

We can improve this further by trying to reduce the processing of skewed keys. Writing and reading the partial results is redundant. We can avoid this by proceeding as follows: First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

  • If it has key 1, then use the hashed version of B to compute the result.
  • For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we only end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.