Status

Current state: "Under Discussion"

Discussion thread: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

This proposal suggests adding compression level and compression buffer size options to the producer, topic, and broker config.

Basically, CPU (running time) and I/O (compressed size) are trade-offs in compression. Since the best is use case dependent, lots of compression algorithms provide a way to control the compression level with reasonable default level, which results in a good performance in general. Add to this, the compression ratio is also affected by the buffer size - some users may want to trade compression for less/more memory.

However, Kafka does not provide a way to configure those options. Although it shows good performance with default settings, there are some cases which don't fit. For example:

zstd supports a wide range of compression levels to reach the Pareto frontier, which means it decompresses faster than any other algorithm with similar or better compression ratio. In other words, disallowing users to adjust compression level means abandoning much of potential of zstd. In fact, Kafka's go client (sarama) already supports compression level feature for gzip and zstd.
The default buffer size for lz4 is a little bit small (64kb). By changing up this value, the compression ratio can be improved.

Public Interfaces

This feature introduces new options for compression level and buffer size to the producer, topic, and broker configuration. About public interfaces, there are two available alternatives:

Type A: Traditional Style

This approach introduces two config entities: compression.level and compression.buffer.size.

compression.type=gzip
compression.level=4				# NEW: Compression level to be used.
compression.buffer.size=8192	# NEW: The size of compression buffer to be used.

For available values per compression type, please refer 'Compression level' and 'Compression buffer size' subsection under 'Proposed Changes'.

Pros:

Easy to adapt.

Cons:

It adds two configuration entities and even may require more entities following compression codecs' update.
Every time the user tries to change the compression type, they must update `compression.level` and `compression.buffer.size` accordingly. In other words, it is error-prone.

Type B: Key-Value Style (proposed by Becket Qin)

This approach introduces only one config entity, compression.config. It contains a list of KEY:VALUEs concatenated with comma, like listener.security.protocol.map or max.connections.per.ip.overrides.

compression.type=gzip
compression.config=gzip.level:4,gzip.buffer.size:8192,snappy.buffer.size:32768,lz4.level:9,lz4.buffer.size:4,zstd.level:3	# NEW

The available KEYs are: gzip.level, gzip.buffer.size, snappy.buffer.size, lz4.level, lz4.buffer.size, zstd.level. For available values per each key, please refer 'Compression level' and 'Compression buffer size' subsection under 'Proposed Changes'.

Pros:

1. Adds only one configuration entity, and easy to following compression codecs' updates - If there is a new configuration option for compression codec,
2. Easy to switch compression type: Kafka broker picks up appropriate configuration following the compression type.

Cons:

1. A little bit complicated for initial set up.

Proposed Changes

The compression will be done with the specified level using the specified compression buffer size, like the following:

If the specified option is not supported for the codec ('compression.type'), it is ignored (e.g., compression level for snappy or buffer size for zstd.)
If the specified value is not available (or invalid), it raises an error.
If there is no specified value, it falls back to the default one.

The valid range and default value of compression level and buffer size are entirely up to the compression library, so they may be changed in the future. As of June 2019, their current values are like the following:

Compression level

Compression Codec	availability	Valid Range	Default
gzip	Yes	1 (Deflater.BEST_SPEED) ~ 9 (Deflater.BEST_COMPRESSION)	6
snappy	No	-	-
lz4	Yes	1 ~ 17	9
zstd	Yes	-131072 ~ 22	3

Compression buffer size

Compression Codec	availability	Valid Range	Default	Note
gzip	Yes	Positive Integer	8192 (8kb)	Kafka's own default.
snappy	Yes	Positive Integer	32768 (32kb)	Library default.
lz4	Yes	4 ~ 7 (4=64kb, 5=256kb, 6=1mb, 7=4mb)	4 (64kb)	Kafka's own default.
zstd	No	-	-	-

Compatibility, Deprecation, and Migration Plan

Since this update follows the default compression level and current buffer size if they are not set, there is no backward compatibility problem.

Rejected Alternatives

Can we support compression level feature only?

We can, but after some discussion, we decided that supporting both options at once is better - since both of them impacts the compression. So we decided to expand the initial proposal, which handles compression level only.

Can we support universal 'default compression level' value for producer config?

Impossible. Currently, most compression codecs allow to adjust the compression level with int type value, and it seems like it never changes. However, not all of these codecs support a value to denote 'default compression level' or the assigned value to default level differs; For example, gzip uses '-1' for default level but zstd used 0 for default level; Since the latest release of zstd allows negative compression level, the meaning of 0 level is also changing.

For these reasons, we can't provide a universal int value to denote default compression level.

Can we use external dictionary feature?

This feature requires an option to specify the dictionary for the supported codec, e.g., snappy, lz4, and zstd. It obviously over the scope of this KIP.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Type A: Traditional Style

Pros:

Cons:

Type B: Key-Value Style (proposed by Becket Qin)

Pros:

Cons:

Proposed Changes

Compression level

Compression buffer size

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Can we support compression level feature only?

Can we support universal 'default compression level' value for producer config?

Can we use external dictionary feature?

Space shortcuts

Child pages

KIP-390: Allow fine-grained configuration for compression

Status

Motivation

Public Interfaces

Type A: Traditional Style

Pros:

Cons:

Type B: Key-Value Style (proposed by Becket Qin)

Pros:

Cons:

Proposed Changes

Compression level

Compression buffer size

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Can we support compression level feature only?

Can we support universal 'default compression level' value for producer config?

Can we use external dictionary feature?