You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

This page describes the investigations into improving the compression techniques used in columnar storage, for instance, with RCFile.

Hadoop already provides BZip2 and Gzip codecs which can be used for columnar compression in hive/hcatalog.

There are other compressors available which are better at resource usage (snappy,...), or provide better compression (paq,...) but we should be able to exploit certain aspects of the usage scenario here to provide a compression which is both better and faster than the options available.

  • the fact that the data is of the same type
  • in some cases, that data may be ordered
  • block compression is always possible
  • record order may not matter in most situations

Initially, I am going to look into integer compression but the learnings here should translate to other types as well.

Here is the compression metrics with some simulated data. The number of bits per symbol, where the symbol is a 32-bit integer is shown here. In general, bzip2 does well in terms of compression efficiency, but still has a higher bits-per-symbol than 1st order entropy so we could exploit this to produce a better compressor using a 1-level ppm compression.

data set

entropy

1st-order entropy

arithmetic

continuation bit

no-op

elias delta

elias gamma

elias omega

golomb

bzip2

default

gzip

huffman

1st order huffman

vint

col1

4.038

3.962

4.053

13.184

32

12.131

13.247

13.357

9.069

4.75

6.438

6.439

4.088

4.552

10.289

col2

2.497

2.483

2.503

8.521

32

10.877

11.761

12.699

7.943

3.013

4.353

4.353

2.565

2.614

8

col3

3.559

1.932

3.573

39.489

32

39.347

60.694

42.347

32.63

2.57

3.446

3.447

3.625

2.348

39.997

col4d

8.417

0.027

8.514

40

32

40

62

43

33

0.32

0.11

0.111

8.675

1.335

40

To be done: Analysis of cpu/memory

Issues to be sorted out:

The SerDe/Compression codecs split - where serdes convert from cells to byte arrays and compression codecs compress these byte arrays later, means that we need to (a) either deserialize before compression or (b) allow bit boundaries for cell values.

  • No labels