Table of Contents |
---|
Status
Current state: Under Discussion
...
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Kafka has a strict message size limit on the messages. This size limit is applied to the compressed messages as well.
...
- Change the way to estimate the compression ratio
- Split the oversized batch and resend the split batches.
Public Interfaces
This KIP introduces the following new metric batch-split-rate
to the producer. The metric records rate of the batch split occurrence.
Although there is no other public API change, due to the behavior change, users may want to reconsider batch size setting to improve performance.
Proposed Changes
Decompress the batch which encounters a RecordTooLargeException, split it into two and send it again
...
- Choosing
COMPRESSION_RATIO_DETERIORATE_STEP
to be 0.005 means that it will take about 20 batches to decrease compression ratio by 0.1. So in a normal case, after a batch is split, it takes quite a few batches to reach another good estimated compression ratio which may cause batch split again (assuming there is no batch in the middle to deteriorate the compression ratio estimation). - Use 10:1 as
COMPRESSION_RATIO_DETERIORATE_STEP:COMPRESSION_RATIO_IMPROVING_STEP.
As an intuitive approximation, statistically speaking this allows the number of high compression ratio message and low compression ratio message to be less than 10:1 without a split. For example, when there are 9 batches improving the compression ratio, the 10th batch may deteriorate the compression ratio. - For compression efficiency, 1.05 is chosen as the
COMPRESSION_RATIO_ESTIMATION_FACTOR
. This value is what we are using today and it gives reasonable contingency for batch size estimation.
Compatibility, Deprecation, and Migration Plan
The KIP is backwards compatible.
Rejected Alternatives
Batching the messages based on uncompressed bytes
Introduce a new configuration enable.compression.ratio.estimation to allow the users to opt out the compression ratio estimation, but use the uncompressed size for batching directly.
...