Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Optimizing Skewed Joins

The Problem

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer. If the key is highly skewed, this can cause certain Reducers to become the bottleneck. If we have the skew information, we can remove this bottleneck as follows:For keys which are not skewed, the join can be done as usual. For the skewed keys, we can do a join with only those rows in opposite table which have this key.

For e.g.,
Suppose we have table A has keys with a key column, "id" which has values 1, 2, 3 and 4, of which 1 is skewed. We want to join this with table B which has keys 1, 2 and 3. Currently, it will be done as follows: A set of mappers read both the tables and pass on each key to a separate reducer. Since keys 2 and 3 are small, they will complete their job quickly. But since key 1 is skewed, it will continue running even after the others have completed, and table B with a similar column, which has values 1, 2 and 3.
We want to do a join corresponding to the following query

  • select A.id from A join B on A.id = B.id

A set of Mappers read the tables and gives them to Reducers based on the keys. e.g., rows with key 1 go to Reducer R1, rows with key 2 go to Reducer R2 and so on. These Reducers do a cross product of the values from A and B, and write the output. The Reducer R4 gets rows from A, but will not produce any results.

Now let's assume that A was highly skewed in favor of id = 1. Reducers R2 and R3 will complete quickly but R1 will continue for a long time, thus becoming the bottleneck. If the user has information about the skew, the bottleneck can be avoided manually as follows:

Do two separate queries

  • select A.id from A join B on A.id = B.id where A.id <> 1;
  • select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data does not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

Advantages

  • If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.

Disadvantages

  • The tables A and B have to be read and processed twice.
  • The user needs to be aware of the skew in the data and manually do the above process.

To improve this, we will first do the usual join of A without 1 and B, and store it in tmp1. Then we will join the key 1 of A with key 1 of B and store it in tmp2. The partial results, tmp1 and tmp2 are then merged to get the final answer. Here, we do not have to restrict key 1 to a single reducer. So it doesn't become the bottleneck.

...