Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

  • Advantages
    • If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.
  • Disadvantages
    • The tables A and B have to be read and processed twice.
    • Because of the partial results, the results also have to be read and written twice.
    • The user needs to be aware of the skew in the data and manually do the above process.

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

...

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.

Hive Enhancements

Hive needs to be extended to support the following:

    • create table <T> (schema) skewed by (keys) with skew (values);
    • alter table <T> (schema) skewed by (keys) with skew (values);

e.g.,

    • create table T (c1 string, c2 string) skewed by (c1) with skew ('x1');
    • alter table T (c1 string, c2 string, c3 string) skewed by (c1, c2) with skew (('x1', 'y1'), ('x2', 'y2));