This page describes the investigations into improving the compression techniques used in columnar storage, for instance, with RCFile.
Hadoop already provides BZip2 and Gzip codecs which can be used for columnar compression in hive/hcatalog.
There are other compressors available which are better at resource usage (snappy,...), or provide better compression (paq,...) but we should be able to exploit certain aspects of the usage scenario here to provide a compression which is both better and faster than the options available.
- the fact that the data is of the same type
- in some cases, that data may be ordered
- block compression is always possible
- record order may not matter in most situations
Initially, I am going to look into integer compression but the learnings here should translate to other types as well.
Here is the compression metrics with some simulated data. The number of bits per symbol, where the symbol is a 32-bit integer is shown here. In general, bzip2 does well in terms of compression efficiency, but still has a higher bits-per-symbol than 1st order entropy so we could exploit this to produce a better compressor using a 1-level ppm compression.
data set |
entropy |
1st-order entropy |
arithmetic |
continuation bit |
no-op |
elias delta |
elias gamma |
elias omega |
golomb |
bzip2 |
default |
gzip |
huffman |
1st order huffman |
vint |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
col1 |
4.038 |
3.962 |
4.053 |
13.184 |
32 |
12.131 |
13.247 |
13.357 |
9.069 |
4.75 |
6.438 |
6.439 |
4.088 |
4.552 |
10.289 |
col2 |
2.497 |
2.483 |
2.503 |
8.521 |
32 |
10.877 |
11.761 |
12.699 |
7.943 |
3.013 |
4.353 |
4.353 |
2.565 |
2.614 |
8 |
col3 |
3.559 |
1.932 |
3.573 |
39.489 |
32 |
39.347 |
60.694 |
42.347 |
32.63 |
2.57 |
3.446 |
3.447 |
3.625 |
2.348 |
39.997 |
col4d |
8.417 |
0.027 |
8.514 |
40 |
32 |
40 |
62 |
43 |
33 |
0.32 |
0.11 |
0.111 |
8.675 |
1.335 |
40 |
To be done: Analysis of cpu/memory
Issues to be sorted out:
The SerDe/Compression codecs split - where serdes convert from cells to byte arrays and compression codecs compress these byte arrays later, means that we need to (a) either deserialize before compression or (b) allow bit boundaries for cell values.