Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data does do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

...

  • The tables A and B have to be read and processed twice.
  • Because of the partial results, the results also have to be read and written twice.
  • The user needs to be aware of the skew in the data and manually do the above process.

...

  • .

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

...