Goal
The top level problem is as follows:
There are many tables of the following format:
- create table T(a, b, c, ....., x) partitioned by (ds);
and the following queries need to performed efficiently:
- select ... from T where x = 10;
The cardinality of 'x' is in 1000's per partition of T. Moreover, there is a skew for the values of 'x'. In general, there are ~10 values of 'x' which have a very large skew, and the remaining
values of 'x' have a small cardinality. Also, note that this mapping (values of 'x' with a high cardinality can change daily).
The above requirement can be solved in the following ways:
Basic Partitioning
Create a partition per value of 'x'.
- create table T(a,b,c, .......) partitioned by (ds, x)
- Advantages
- Existing hive is good enough
- Disadvantages
- HDFS scalability: Number of files in HDFS increases.
- HDFS scalability: Number of intermediate files in HDFS increases. For eg. if there are 1000 mappers and 1000 partitions, and each mapper gets atleast 1 row for each key, we will end up creating 1 million intermediate files.
- Metastore Scalability: Will the metastore scale with the number of partitions.
List Bucketing
The basic idea here is as follows: Identify the keys with a high skew. Have one file per skewed key, and the remaining keys go into a separate file. This mapping is maintained in the metastore at a partition level, and is used by the
hive compiler to do input pruning. The list of skewed keys is stored at the table level (note that, this list can be initially supplied by the client periodically, and can be eventually updated when a new partition is being loaded).
For eg. the table maintains the list of skewed keys for 'x': 6, 20, 30, 40. When a new partition is being loaded, it will create 5 files (4 skewed keys + 1 file for all the remaining keys). The partition that got loaded will have the
following mapping: 6,20,30,40,others. This is similar to hash bucketing currently, where the bucket number determines the file number. Since the skewed keys need not be consecutive, the entire list of skewed keys need to be stored
in each partition.
When a query of the form
- select .. from T where ds = '2012-04-15' and x = 30;
is issued, the hive compiler will only use the file corresponding to x=30 for the map-reduce job.
For a query of the form
- select .. from T where ds = '2012-04-15' and x = 50;
the hive compiler will only use the file corresponding to x=others for the map-reduce job.
This approach is good under the following assumptions:
- There are a few skewed keys per partition, which account for a significant percentage of the total data. In the above example, if the skewed keys (6,20,30 and 40) only occupy a small percentage of the data (say 20%), the queries of the form x=50 will still need to scan the remaining data (~80%).
- The number of skewed keys are few. This list is stored in the metastore, so it does not make sense to store 1 million skewed keys per partition in the metastore.
This approach can be extended to the scenario when there are more than one clustered key. Say we want to optimize the queries of the form
- select ... from T where x = 10 and y = 'b';
- Extend the above approach. For each skewed value of (x,y), store the file offset. So, the metastore will have the mapping like: (10, 'a') -> 1, (10, 'b') -> 2, (20, 'c') -> 3, (others) -> 4.
A query with all the clustering keys specified can be optimized easily. However, queries with some of the clustering keys specified:
-
- select ... from T where x = 10;
- select ... from T where y = 'b';
can only be used to prune very few files. It does not really matter, if the prefix of the clustering keys is specified or not. For eg. for x=10, the hive compiler can prune the file corresponding to (20, 'c').
And, for y='b', the files corresponding to (10, 'a') and (20, 'c') can be pruned. Hashing for others does not really help, when the complete key is not specified:
This approach does not scale in the following scenarios:
- The number of skewed keys are very large. Creates a problem for metastore scalability.
- In most of the cases, the number of clustered keys is more than one, and in the query, all the clustered keys are not specified.
Hive Enhancements
Hive needs to be extended to support the following:
-
- create table <T> (schema) list bucketed by (keys) with skew ('c1', 'c2);
- alter table <T> (schema) list bucketed by (keys) with skew ('c1', 'c2);
For eg:
-
- create table T (c1 string, c2 string) list bucketed by (c1) with skew ('x1');
- create table T (c1 string, c2 string, c3 string) list bucketed by (c1, c2) with skew (('x1', 'x2'), ('y1', 'y2'));
Design
When such a table is being loaded, it would be good to create a sub-directory per skewed key. The infrastructure similar to dynamic partitions can be used.
Alter table <T> partition <P> concatenate; needs to be changed to merge files per directory