Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On September 2016, Facebook announced a new compression implementation named ZStandard, which is designed  designed to scale with modern data processing environment. With its great performance in both of Speed speed and Compression compression rate, Hadoop and HBase will support ZStandard in a close future.

I propose for Kafka to add support of Zstandard compression, along with new configuration options and binary log format update.

Before we go further, it would be better to see the benchmark result of Zstandard. I compared the compressed size and compression time of 3 1kb-sized messages (3102 bytes in total), with the Draft-implementation of ZStandard Compression Codec and all currently available CompressionCodecs. You can see the benchmark code from this branch. All elapsed times are the average of 100 iterations, preceded by 5 warm up iterations. (To run the benchmark in your environment, move to jmh-benchmarks and run following command: ./jmh.sh -wi 5 -i 100 -f 1)

 

CodecLevelSize (byte)Time (ms)Description
Gzip-3960.083 ± 0.008 
Snappy-1,0630.030 ± 0.001 
LZ4-3870.012 ± 0.001 
Zstandard13740.045 ± 0.003Speed-first setting.
23740.039 ± 0.001 
33790.057 ± 0.003Facebook's recommended default setting.
43790.121 ± 0.013 
53730.081 ± 0.004 
63730.135 ± 0.016 
73730.688 ± 0.060 
83730.805 ± 0.072 
93731.038 ± 0.060 
103731.400 ± 0.099 
113732.515 ± 0.188 
123732.413 ± 0.195 
133732.889 ± 0.219 
143732.340 ± 0.030 
153741.943 ± 0.118 
163746.759 ± 0.625 
173713.045 ± 0.198 
183718.508 ± 0.787 
193688.721 ± 0.499 
2036829.475 ± 2.456 
2136854.713 ± 5.023 
22368227.643 ± 18.390Size-first setting.

lots of popular big data processing frameworks are supporting ZStandard.

  • Hadoop (3.0.0) - 
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyHADOOP-13578
  • HBase (2.0.0) - 
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyHBASE-16710
  • Spark (2.3.0) - 
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keySPARK-19112

ZStandard also works well with Apache Kafka. Benchmarks with the draft version (with ZStandard 1.3.3, Java Binding 1.3.3-4) showed significant performance improvement. The following benchmark is based on Shopify's production environment (Thanks to @bobrik)

Image AddedImage Added

(Above: Drop around 22:00 is zstd level 1, then at 23:30 zstd level 6.)

As You can see, ZStandard outperforms with a compression ratio of 4.28x; Snappy is just 2.5x and Gzip is not even close in terms of both of ratio and speed.

It is worth noting that this outcome is based on ZStandard 1.3. According to Facebook, ZStandard 1.3.4 improves throughput by 20-30%, depending on compression level and underlying I/O performance.

Image Added

(Above: Comparison between ZStandard 1.3.3. vs. ZStandard 1.3.4.)

Image Added

(Above: Comparison between other compression codecs, supported by Kafka.)

As of May 2018, Java binding for ZStandard 1.3.4 is still in progress; it will be updated before merging if this proposal is approved.

Accompanying Issues

However, supporting ZStandard is not just adding new compression codec; It introduces several issues related to it. We need to address those issues first:

Backward Compatibility

Since the producer chooses the compression codec by default, there are potential problems:

A. An old consumer that does not support ZStandard receives ZStandard compressed data from the brokers.
B. Old brokers that don't support ZStandard receives ZStandard compressed data from the producer.

To address the problems above, we have following options:

a. Bump the produce and fetch protocol versions in ApiKeys.

Advantages:

  • Can guide the users to upgrade their client.
  • Can support advanced features.
  • Broker Transcoding: Currently, the broker throws UNKNOWN_SERVER_ERROR for unknown compression codec with this feature.
  • Per Topic Configuration - we can force the clients to use predefined compression codecs only by configuring available codecs for each topic. This feature should be handled in separate KIP, but this approach can be a preparation.

Disadvantages:

  • The older brokers can't make use of ZStandard.
  • Short of a bump to the message format version.

b. Leave unchanged - let the old clients fail.

Previously added codecs, Snappy (commit c51b940) and LZ4 (commit 547cced), follow this approach. With this approach, the problems listed above ends with following error message:

Code Block
java.lang.IllegalArgumentException: Unknown compression type id: 4

Advantages:

  • Easy to work: we need nothing special.

Disadvantages:

  • The error message is a little bit short. Some users with old clients may be confused how to cope with this error.

c. Improve the error messages

This approach is a compromise of a and b. We can provide supported api version for each compression codec within the error message by defining a mapping between CompressionCodec and ApiKeys:

Code Block
NoCompressionCodec => ApiKeys.OFFSET_FETCH        // 0.7.0
GZIPCompressionCodec => ApiKeys.OFFSET_FETCH      // 0.7.0
SnappyCompressionCodec => ApiKeys.OFFSET_FETCH    // 0.7.0
LZ4CompressionCodec => ApiKeys.OFFSET_FETCH       // 0.7.0
ZStdCompressionCodec => ApiKeys.DELETE_GROUPS     // 2.0.0

Advantages:

  • Not so much work to do.
  • Can guide the users to upgrade their client.
  • Spare some room for advances features in the future, like Per Topic Configuration.

Disadvantages:

  • The error message may still short.

Support Dictionary

Another issue worth bringing into play is the dictionary feature. ZStandard offers a training mode, which yields dictionary for compressing and decompression. It dramatically improves the compression ratio of small and repetitive input (e.g., semi-structured json), which perfectly fits into Kafka's case. (For real-world benchmark, see here) Although the details of how to adapt this feature into Kafka (example) should be discussed in the separate KIP, We need to leave room behindAs you can see above, ZStandard shows outstanding performance in both of compression rate and speed, especially working with the speed-first setting (level 1). To the extent that only LZ4 can be compared to ZStandard.

Public Interfaces

This feature requires modification on both of Configuration Options and Binary Log format.

...

  1. Add a new dependency on the Java bindings of ZStandard compression.
  2. Add a new value on CompressionType enum type and define ZStdCompressionCodec on kafka.message package.
  3. Add appropriate routine for the backward compatibility problem discussed above.

You can check the concept-proof implementation of this feature on this Pull Request.

Compatibility, Deprecation, and Migration Plan

NoneIt is entirely up to the community's decision for the backward compatibility problem.

Rejected Alternatives

None yet.

...