Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Performance numbers of other vendors demonstrate that we could expect 2x-4x decrease in required disk space and >1.5x increase in throughput (ops/sec) on typical workloads.

Competitive

...

Analysis

This section describes general compression approaches and their pros and cons. The following compression mechanisms are implemented in practice:

...

Efficient disk usage starts with proper data layout. Vendors strive to place data in pages in such a way that total overhead is kept as low as possible while still maintaining high read speed. Typically this is achieved as follows:

  • Common metadata

...

  • is stored outside of data page
  • Numeric types are written using varlen encoding (e.g. int data type may take 1-5 bytes instead of 4)
  • Fixed-length string data types (CHAR, NCHAR) are trimmed
  • NULL and zero values are optimized to consume no space, typically with special bitmap (e.g. if there are 

Examples:

  1. SQL Server row format [1] - varlen, CHAR trimming, NULL/zero optimization
  2. MySQL row format [2] - no varlen, no CHAR trimming, NULL/zero optimization

...

Prefix compression could be applied to the following index types:

  • Non-unique single column secondary index
  • Non-unique and unique multi-column secondary index

Prefix compression could be implemented as follows:

  • Static - compression is applied to all index rows irrespective of whether it is beneficial or not. Attributes with low cardinality are compressed well. Contrary, attributes with high cardinality may have negative compression rates. Decision whether to compress or not is delegated to user (typically DBA)
  • Dynamic - compression is either applied or not applied to indexed values on page-by-page basis based on some internal heuristics. Negative compression rates are avoided automatically, but implementation is more complex.

Examples:

  1. Oracle index compression (static) [1]
  2. Oracle advanced index compression (dynamic) [2]
  3. MongoDB index compression [3]

...

[1] http://www.oracle.com/technetwork/database/features/availability/311358-132337.pdf
[2] https://msdn.microsoft.com/en-us/library/gg492088(v=sql.120).aspx
[3] https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-2017
[4] https://docs.microsoft.com/en-us/sql/relational-databases/data-compression/data-compression?view=sql-server-2017#using-columnstore-and-columnstore-archive-compression

Misc Approaches

Compressing

...

Large Values

Large values, such as LOBs and long varchars, cannot be stored in original data block. Some vendors compress these values and then split into pieces. 

...

[1] https://wiki.postgresql.org/wiki/TOAST
[2] https://docs.oracle.com/database/121/ADLOB/adlob_smart.htm#ADLOB45944
[3] https://docs.microfocus.com/itom/Network_Automation:10.50/Administer/DB_Compression/ConfiguringLOB_Oracle

Compression

...

During Data Load

Oracle attempts to compress values during data load (Direct Path, CTAS, INSERT ... APPEND) [1]. Compression is applied on per-block basis using dictionary approach. Oracle may decide to skip compression if there are no benefits. Alternatively, it may reorder attrbutes in rows to get longer common prefixes and improve compression ratio.

[1] https://www.red-gate.com/simple-talk/sql/oracle/compression-oracle-basic-table-compression/

Best practices

Proposed changes

Proposed Changes

Some approaches adds more value than others. Some approaches are hard to implement, some are easy. For this reason compression should be implemented in phases, with the most efficient and simple techniques being developed first. Proposed plan:

Phase 1: Low Hanging Fruits

  • Index Prefix Compression - efficient and relatively easy to implement
  • WAL Compression - could increase throughput and easy to evaluate

Phase 2: The Battle

  • Page Compression - efficient, but implementation would be complex. with lots of changes to storage engine

Phase 3: Excellence

  • Data Format improvements - moderate value for the system, complex to implement
  • Column Compression - depends on new data format 

Out of Scope

The following changes are not likely to be implemented in the nearest time due to their complexity and/or limited impact on general use cases:

  • Columnar store - require large changes in storage engine

 

 

 TBD

 

Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Ignite AND labels IN (iep-20) ORDER BY status
serverId5aa69414-a9e9-3523-82ec-879b028fb15b