Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is especially an issue for services like MirrorMaker whose producer is shared by many different topics.

Public Interfaces

We want to introduce a new configuration enable.compression.ratio.estimation to allow the users to opt out the compression ratio estimation, but use the uncompressed size for batching directly.

Proposed Changes

We want to introduce a new configuration enable.compression.ratio.estimation to allow the users to opt out the compression ratio estimation, but use the uncompressed size for batching directly.

The default value of this configuration will be true.

When enable.compression.ratio.estimation is set to false. The producer will stop estimate the compressed batch size but simply use the uncompressed message size for batching. In reality, if the batch size is set to close to the max message size, the compression ratio may not be an issue. If a single message is larger than the batch.size, the behavior would be the same as the current behavior, i.e. it will be put into a new batch.

This approach would guarantee that the compressed message size will be less than the max message size. And as long as the batch size is set to a reasonable number, the compression ratio is unlikey to be hurt.

Compatibility, Deprecation, and Migration Plan

The KIP only inroduce a new configuraton. The change is completely backwards compatible.

Rejected Alternatives

This KIP tries to solve this issue by doing the followings:

  1. Change the way to estimate the compression ratio
  2. Split the oversized batch and resend the split batches. 

Public Interfaces

This KIP introduces the following new metric batch-split-rate to the producer. The metric records rate of the batch split occurrence.

Although there is no other public API change, due to the behavior change, users may want to reconsider batch size setting to improve performance. 

Proposed Changes

Decompress the batch which encounters a RecordTooLargeException, split it into two and send it again

The issue of this approach has some caveatsThere are a few things to be think about for this approach:

  1. More overhead introduced on the producer side. The producer have to decompress the batch, regroup the messages and resend them. If the producer keeps the original uncompressed message to avoid potential decompression, it will have huge memory overhead.
  2. The split batches is not guaranteed to be smaller than the max size limit. Potentially there will multiple retries until the messages get through, or fail.
  3. In the scenario such as mirror maker, due to different compression ratio in different topics, some of the topics may have very different compression ratio from the average compression ratio, this will potentially introduce many split and resend, which introduces a lot of overhead.

Keep per topic compression ratio

...

To address the above caveats, we propose to also change the way to estimate the compression ratio:

  1. Estimate the compression ratio for each topic independently.
  2. Given that COMPRESSION_RATIO = COMPRESSED_SIZE/UNCOMPRESSED_SIZE, Change the compression ratio estimation from weighted average on a sliding window to the following:
    1. Initially set ESTIMATED_RATIO = 1.0
    2. If OBSERVED_RATIO < ESTIMATED_RATIO, decrease the ESTIMATED_RATIO by COMPRESSION_RATIO_IMPROVING_STEP (0.005)
    3. If OBSERVED_RATIO > ESTIMATED_RATIO, increase the ESTIMATED_RATIO by COMPRESSION_RATIO_DETERIORATE_STEP (0.05)
    4. If batch split occurred, reset the ESTIMATED_RATIO to 1.0

Based on the test in this patch, the chance of splitting a batch is much less than 10% even when the compression ratio of the messages in the same topic are highly different.

Compatibility, Deprecation, and Migration Plan

The KIP is backwards compatible.

Rejected Alternatives

Batching the messages based on uncompressed bytes

Introduce a new configuration enable.compression.ratio.estimation to allow the users to opt out the compression ratio estimation, but use the uncompressed size for batching directly.

The downside of this approach are:

  1. it adds a new configuration to the producer which exposes some nuances.
  2. for highly compressible messages, users may still need to guess the compression ratio to ensure the compressed batch is not too small.