Status

Current state: Under Discussion

Discussion thread: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

On September 2016, Facebook announced a new compression implementation named ZStandard, which is designed to scale with modern data processing environment. With its great performance in both of Speed and Compression rate, Hadoop and HBase will support ZStandard in a close future.

I propose for Kafka to add support of Zstandard compression, along with new configuration options and binary log format update.

Before we go further, it would be better to see the benchmark result of Zstandard. I compared the compressed size and compression time of 3 1kb-sized messages (3102 bytes in total), with the Draft-implementation of ZStandard Compression Codec and all currently available CompressionCodecs. You can see the benchmark code from this commit. All elapsed times are the average of 100 iterations, preceded by 5 warm up iterations. (To run the benchmark in your environment, move to jmh-benchmarks and run following command: ./jmh.sh -wi 5 -i 100 -f 1)

Codec	Level	Size (byte)	Time (ms)	Description
Gzip	-	396	0.083 ± 0.008
Snappy	-	1,063	0.030 ± 0.001
LZ4	-	387	0.012 ± 0.001
Zstandard	1	374	0.045 ± 0.003	Speed-first setting.
	2	374	0.039 ± 0.001
	3	379	0.057 ± 0.003	Facebook's recommended default setting.
	4	379	0.121 ± 0.013
	5	373	0.081 ± 0.004
	6	373	0.135 ± 0.016
	7	373	0.688 ± 0.060
	8	373	0.805 ± 0.072
	9	373	1.038 ± 0.060
	10	373	1.400 ± 0.099
	11	373	2.515 ± 0.188
	12	373	2.413 ± 0.195
	13	373	2.889 ± 0.219
	14	373	2.340 ± 0.030
	15	374	1.943 ± 0.118
	16	374	6.759 ± 0.625
	17	371	3.045 ± 0.198
	18	371	8.508 ± 0.787
	19	368	8.721 ± 0.499
	20	368	29.475 ± 2.456
	21	368	54.713 ± 5.023
	22	368	227.643 ± 18.390	Size-first setting.

As you can see above, ZStandard shows outstanding performance in both of compression rate and speed, especially working with the speed-first setting (level 1). To the extent that only LZ4 can be compared to ZStandard.

Public Interfaces

This feature requires modification on both of Configuration Options and Binary Log format.

Configuration

A new available option 'zstd' will be added to the compression.type property, which is used in configuring Producer, Topic and Broker.

Binary Log Format

The bit 2 of 1-byte "attributes" identifier in Message will be used to denote ZStandard compression; Currently, the first 3 bits (bit 0 ~ bit 2) of 1-byte attributes identifier is reserved for compression codec. Since only 4 compression codecs (NoCompression, GZipCompression, SnappyCompression and LZ4Compression) are currently supported, bit 2 has not been used until now. In other words, the adoption of ZStandard will introduce a new bit flag in the binary log format.

Proposed Changes

Add a new dependency on the Java bindings of ZStandard compression.
Add a new value on CompressionType enum type and define ZStdCompressionCodec on kafka.message package.

You can check the concept-proof implementation of this feature on this Pull Request.

Compatibility, Deprecation, and Migration Plan

None.

Rejected Alternatives

None yet.

Related issues

This update introduces some related issues on Kafka.

Whether to use existing library or not

There are two ways of adapting ZStandard to Kafka, each of which has its pros and cons.

Use existing bindings.
- Pros
  - Fast to work.
  - The building task doesn't need ZStandard to be pre-installed to the environment.
- Cons
  - Somebody has to keep the eyeballs on the updates of both of the binding library and ZStandard itself. If needed, he or she has to update the binding library to adapt them to Kafka.
Add existing JNI bindings directly.
- Pros
  - Can concentrate on the updates of ZStandard only.
- Cons
  - ZStandard has to be pre-installed before building Kafka.
  - A little bit cumbersome to work.

The draft implementation adopted the first approach, following its Snappy support. (In contrast, Hadoop follows the latter approach.) You can see the used JNI binding library at here. However, I thought it would be much better to discuss the alternatives, for I am a newbie to Kafka.

Whether to support dictionary feature or not

ZStandard supports dictionary feature, which enables boosting efficiency by sharing learned dictionary. Since each of Kafka log message has repeated patterns, supporting this feature can improve the efficiency one more step further. However, this feature requires a new configurable option to point the location of the dictionary.

Space shortcuts

Child pages

Status

Motivation

Configuration

Binary Log Format

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Related issues

Whether to use existing library or not

Whether to support dictionary feature or not

Space shortcuts

Child pages

KIP-110: Add Codec for ZStandard Compression

Status

Motivation

Configuration

Binary Log Format

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Related issues

Whether to use existing library or not

Whether to support dictionary feature or not