...
The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data does do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.
...
- The tables A and B have to be read and processed twice.
- Because of the partial results, the results also have to be read and written twice.
- The user needs to be aware of the skew in the data and manually do the above process.
...
- .
We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:
...