Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This section describes general compression approaches and their pros and cons. The following compression mechanisms are implemented in practice:

  1. Data format improvementsformat 
  2. Index prefix compression
  3. Page-level compression
  4. WAL compression
  5. Per-column compression
  6. Compression on file system level
  7. Column store

Data Format

...

Efficient disk usage starts with proper data layout. Vendors strive to place data in pages in such a way that total overhead is kept as low as possible while still maintaining high read speed. Typically this is achieved as follows:

...

[3] https://docs.mongodb.com/manual/core/wiredtiger/#storage-wiredtiger-compression

Page-level Compression

The whole pages could be compressed. This gives 2x-4x reduciton in size on average. Two different approaches are used in practice - without in-memory compression, with in-memory compression

Without in-memory compression

Data is stored in-memory as is, in uncompressed form. When it is time to flush data to disk compression is applied. If data size is reduced significantly, data is stored in compressed form. Otherwise it is stored in plain form (compression faiure). Big block sizes (e.g. 32Kb) is typically used in this case to achieve higher compression rates. Data is still being written to disk in blocks of smaller sizes. E.g. one may have 32Kb block in-memory, which is compressed to 7Kb, which is then written as two 4Kb blocks to disk. Vendors allow to select compression algorithm (Snappy, zlib, lz4, etc.).

Hole punching with fallocate [1] might be added if underlying file system supports it. In this case compressed block is written as is, but then empty space is trimmed with separate system call. E.g. if 32Kb block is compressed to 6.5Kb, then 32Kb is written as is, and then 32 - 7 = 25 Kb are released. 

Advantages:

  • High compression rates
  • No overhead when reading data from memory
  • Ability to choose compression algorithm

Disadvantages:

  • High RAM usage
  • Need to re-compress data frequently
  • Hole-punching is supported by very few file systems (XFS, ext4, Btrfs), and may lead to heavy file maintenance [2]

Examples:

  1. MySQL Table Compression - uses different in-memory and disk block sizes, block data is fully re-compressed on every access [3]
  2. MySQL Page Compression - uses hole-punching instead [4]
  3. MongoDB with Snappy codec - gathers up to 32Kb of data and then try to compress it [5]5

[1] http://man7.org/linux/man-pages/man2/fallocate.2.html

[2] https://mariadb.org/innodb-holepunch-compression-vs-the-filesystem-in-mariadb-10-1/

[3] https://dev.mysql.com/doc/refman/5.7/en/innodb-compression-background.html

[4] https://mysqlserverteam.com/innodb-transparent-page-compression/

[5] https://www.objectrocket.com/blog/company/mongodb-3-0-wiredtiger-compression-and-performance/

 TODO

Draft materials

Compression options:

...