Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Data Format 
  2. Index Prefix Compression
  3. Page Compression
  4. WAL compression
  5. Column Compression
  6. Columnar Store
  7. Row Compression with Dictionary
  8. Misc approaches

Data size could be reduced with efficient data format. Common row metadata could be skipped. Numeric values could be encoded to occupy less space. Fixed-length strings could be trimmed. NULL and zero values could be skipped with help of small row bitmaps.

...

[1] http://www.oracle.com/technetwork/database/features/availability/311358-132337.pdf
[2] https://msdn.microsoft.com/en-us/library/gg492088(v=sql.120).aspx
[3] https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-2017
[4] https://docs.microsoft.com/en-us/sql/relational-databases/data-compression/data-compression?view=sql-server-2017#using-columnstore-and-columnstore-archive-compression

Misc Approaches

Compressing Large Values

Row Compression with Dictionary

Usually data is stored in row format, and there's a lot of overlap between values in different rows. Flags, enum values, dates or strings can have same byte sequences repeating from row to row. It is possible to harvest a set of typical rows for a table, create an external dictionary based on them, and then reuse this dictionary when writing each next row. This only offers limited benefits for classical RDBMS since their row format is low-overhead and with fixed field offset lookups, which are defeated by compression. However, BinaryObjects used in Ignite are high-overhead, with field/type information repeating in every record, and offset lookups are not used. Row compression can provide high yield with low overhead. In theory it is possible to share dictionary between nodes, but having separate dictionaries look more practical.

Advantages:

  • Easy to implement - no architectural changes
  • Reasonably fast in both writing and reading
  • 2.5x compression on mock data with naive prototype
  • More data per page - more data fits in RAM - less latency even if throughput is lower

Disadvantages:

  • Need to keep dictionary alongside the data
  • Occassionally need to update the dictionary as data evolves
  • Keep track of multiple dictionaries per node
  • Pages from different nodes use different dictionaries, might interfere with checkpointing

Examples:

  1. IBM DB2 supports this approach on per-table basis [1]
  2. A prototype of dictionary-based row compression for Apache Ignite [2]

[1] https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dbobj.doc/doc/c0052331.html

[2] https://github.com/apache/ignite/pull/4295

Misc Approaches

Compressing Large Values

Large values, such as LOBs and long varchars, cannot be stored in original data block. Some vendors compress these values and then Large values, such as LOBs and long varchars, cannot be stored in original data block. Some vendors compress these values and then split into pieces. 

Examples:

...

  • Page Compression - efficient, but implementation would be complex. with lots of changes to storage engine
  • or, Row Compression with Dictionary - no changes to storage engine, but adds management of dictionaries

Phase 3: Excellence

  • Data Format improvements - moderate value for the system, complex to implement, benefit may be cancelled out by actual compression
  • Column Compression - depends on new data format 

...