Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The problem happens because the "stickiness" time is driven by the new batch creation, which is reciprocal to broker latency - slower brokers drain batches slower, so they get more of the "sticky" time than faster brokers, thus skewing the distribution.  The details of the scenario are described well here.

Suppose that we have a producer writing to 3 partitions with linger.ms=0 and one partition slows down a little bit for some reason. It could be a leader change or some transient network issue. The producer will have to hold onto the batches for that partition until it becomes available. While it is holding onto those batches, additional batches will begin piling up. Each of these batches is likely to get filled because the producer is not ready to send to this partition yet.

Consider this from the perspective of the sticky partitioner. Every time the slow partition gets selected, the producer will fill the batches completely. On the other hand, the remaining "fast" partitions will likely not get their batches filled because of the `linger.ms=0` setting. As soon as a single record is available, it might get sent. So more data ends up getting written to the partition that has already started to build a backlog. And even after the cause of the original slowness (e.g. leader change) gets resolved, it might take some time for this imbalance to recover. We believe this can even create a runaway effect if the partition cannot catch up with the handicap of the additional load.

We analyzed one case where we thought this might be going on. Below I've summarized the writes over a period of one hour to 3 partitions. Partition 0 here is the "slow" partition. All partitions get roughly the same number of batches, but the slow partition has much bigger batch sizes.

Partition TotalBatches TotalBytes TotalRecords BytesPerBatch RecordsPerBatch
0         1683         25953200   25228        15420.80      14.99        
1         1713         7836878    4622         4574.94       2.70
2         1711         7546212    4381         4410.41       2.56

After restarting the application, the producer was healthy again. It just was not able to recover with the imbalanced workload.

This is not the only problem; even when all brokers are uniformly fast, with linger.ms=0 and many brokers, the sticky partitioner doesn't create batches as efficiently.  Consider this scenario, say we have 30 partitions, each has a leader on its own broker.

...

Instead of switching parititions on every batch creation, switch partitions every time N partitioner.sticky.batch.size bytes got produced to partition.  Say we're producing to partition 1.  After 16KB got produced to partition 1, we switch to partition 42.  After 16KB got produced to partition 42, we switch to partition 3.  And so on.  We do it regardless of what happens with batching or etc. just count the bytes produced to a partition.  This way the distribution would be both uniform (barring restarts) and sticky even if linger.ms=0 because more consecutive records are directed to a partition, allowing it to create better batches.

...