You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »


Status

Current state: Under Discussion

Discussion thread: here 

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

KIP-480: Sticky Partitioner introduced a UniformStickyPartitioner and made it the default paritioner.  It turned out that despite being called UniformStickyPartitioner, the sticky partitioner is not uniform in a problematic way: it actually distributes more records to slower brokers and can cause "runaway" problems, when a temporary slowness of a broker skews distribution such that the broker gets more records and becomes slower because of that, which in turn skews distribution even more, and the problem is perpetuated.

The problem happens because the "stickiness" time is driven by the new batch creation, which is reciprocal to broker latency - slower brokers drain batches slower, so they get more of the "sticky" time than faster brokers, thus skewing the distribution.  The details of the scenario are described well here.

Suppose that we have a producer writing to 3 partitions with linger.ms=0 and one partition slows down a little bit for some reason. It could be a leader change or some transient network issue. The producer will have to hold onto the batches for that partition until it becomes available. While it is holding onto those batches, additional batches will begin piling up. Each of these batches is likely to get filled because the producer is not ready to send to this partition yet.

Consider this from the perspective of the sticky partitioner. Every time the slow partition gets selected, the producer will fill the batches completely. On the other hand, the remaining "fast" partitions will likely not get their batches filled because of the `linger.ms=0` setting. As soon as a single record is available, it might get sent. So more data ends up getting written to the partition that has already started to build a backlog. And even after the cause of the original slowness (e.g. leader change) gets resolved, it might take some time for this imbalance to recover. We believe this can even create a runaway effect if the partition cannot catch up with the handicap of the additional load.

We analyzed one case where we thought this might be going on. Below I've summarized the writes over a period of one hour to 3 partitions. Partition 0 here is the "slow" partition. All partitions get roughly the same number of batches, but the slow partition has much bigger batch sizes.

Partition TotalBatches TotalBytes TotalRecords BytesPerBatch RecordsPerBatch
0         1683         25953200   25228        15420.80      14.99        
1         1713         7836878    4622         4574.94       2.70
2         1711         7546212    4381         4410.41       2.56

After restarting the application, the producer was healthy again. It just was not able to recover with the imbalanced workload.

This is not the only problem; even when all brokers are uniformly fast, with linger.ms=0 and many brokers, the sticky partitioner doesn't create batches as efficiently.  Consider this scenario, say we have 30 partitions, each has a leader on its own broker.

  1. a record is produced, partitioner assigns to partition1, batch becomes ready and sent out immediately

  2. a record is produced, partitioner sees that a new batch is created, triggers reassignment, assigns to partition2, batch becomes ready and sent out immediately

  3. a record is produced, partitioner sees that a new batch is created, triggers reassignment, assigns to partition3, batch becomes ready and sent out immediately

and so on.  (The actual assignment is random, but on average we'd rotate over all partitions more or less uniformly.) Then it repeats the whole loop once again (the pattern will be the same because we allow 5 in-flight), and again, while it's doing that, the first on the first broker may complete, in which case, a single record batch may be ready again and so on.  This is probably not that big of a deal when the number of brokers is small (or to be precise, the number of brokers that happen to host partitions from one topic), but it's good to understand the dynamics.

So in some sense, the UniformStickyPartitioner is neither uniform nor sufficiently sticky.

Public Interfaces

org.apache.kafka.clients.producer.Partitioner

The Partitioner.partition method can now return -1 to indicate that a default partitioning decision should be made by the producer itself.  Previously, Partitioner.partition was required to return a valid partition number.  This departs from the paradigm that partitioning logic (including the default partitioning logic) is fully encapsulated in a partitioner object, this encapsulation doesn't work well anymore as it requires information that only producer (Sender, RecordAccumulator) can know (such as queue sizes, record sizes, broker responsiveness, etc.).  See the rejected alternatives section for an attempt to preserve encapsulation of default partitioning logic within a paritioner object.

When the producer gets -1 from the partitioner, it calculates the partition itself.  This way the custom paritioner logic can continue to work and the producer would use the partition that is returned from the paritioner, however if the paritioner just wants to use default partitioning logic, it can return -1 and let the producer figure out the partition to use.

This also seems to be more future proof than trying to preserve (partial) encapsulation of partitioning logic within default partitioner, because if in the future we support additional signals, we can just change the logic in the producer and don't need to extend the partitioner interface to pass additional info.

New Configuration

partitioner.sticky.batch.size.  The default would be 0, in which case the batch.size would be used (we can change the default to batch.max.size once KIP-782 is implemented).  See the explanation in the Uniform Sticky Batch Size section.

enable.adaptive.partitioning.  The default would be 'true', if it's true then the producer will try to adapt to broker performance and produce more messages to partitions hosted on faster brokers.  If it's 'false', then the producer will try to distribute messages uniformly.

partition.availability.timeout.ms.  The default would be 0.  If the value is greater than 0 and adaptive partitioning is enabled, and the broker cannot accept a produce request to the partition for partition.availability.timeout.ms milliseconds, the partition is marked as not available.  If the value is 0, this logic is disabled.

Proposed Changes

Uniform Sticky Batch Size

Instead of switching partitions on every batch creation, switch partitions every time partitioner.sticky.batch.size bytes got produced to partition.  Say we're producing to partition 1.  After 16KB got produced to partition 1, we switch to partition 42.  After 16KB got produced to partition 42, we switch to partition 3.  And so on.  We do it regardless of what happens with batching or etc. just count the bytes produced to a partition.  This way the distribution would be both uniform (there could be small temporary imbalance) and sticky even if linger.ms=0 because more consecutive records are directed to a partition, allowing it to create better batches.

Let's consider how the batching is going to be different with a strictly uniform sticky partitioner and linger.ms=0 and 30 partitions each on its own broker.

  1. a record is produced, partitioner assigns to partition1, batch becomes ready and sent out immediately

  2. a record is produced, partitioner is still stuck to partition1, batch becomes ready and sent out immediately

  3. same thing

  4. --

  5. --

  6. a record is produced, partitioner is still stuck to partition1, now we have 5 in-flight, so batching begins

The batching will continue until either an in-flight batch completes or we hit the partitioner.sticky.batch.size bytes and move to the next partition.  This way it takes just 5 records to start batching.  This happens because once we have 5 in-flight, the new batch won't be sent out immediately until at least on in-flight batch and keeps accumulating records.  With the current solution, it takes 5 x number of partitions to have enough batches in-flight so that new batch won't be sent immediately.  As the production rate accelerates, more records could be accumulated while 5 batches are already in-flight, thus larger batches are going to be used for higher production rates to sustain higher throughput.

If one of the brokers has higher latency the records for the partitions hosted on that broker are going to form larger batches, but it's still going to be the same amount records sent less frequently in larger batches, the logic automatically adapts to that.

To summarize, the uniform sticky partitioner has the following advantages:

  1. It's uniform, which is simple to implement and easy to understand.  Intuitively, this is what users expect.

  2. It creates better batches, without adding linger latency on low production rate but switching to better batching on high production rate.

  3. It adapts to higher latency brokers, using larger batches to push data, keeping throughput and data distribution uniform.

  4. It's efficient (logic for selecting partitions doesn't require complex calculations).

Adaptive Partition Switching

One potential disadvantage of strictly uniform partition switching is that if one of the brokers is lagging behind (cannot sustain its share of throughput), the records will keep piling in the accumulator, and will eventually exhaust the buffer pool memory and slow down the production rate to match the capacity of the slowest broker.  To avoid this problem, the partition switch decision can adapt to broker load.

The queue size of batches waiting to be sent is a direct indication of broker load (more loaded brokers would have longer queue).  Partition switching taking into account the queue sizes when choosing next partition.  The probability of choosing a partition is proportional to the inverse of queue size (i.e. partitions with longer queues are less likely to be chosen).

In addition to queue size - based logic, partition.availability.timeout.ms can set to a non-0 value, in which case partitions that have batches ready to be sent for more than partition.availability.timeout.ms milliseconds, would be marked as not available for partitioning and would not be chosen until the broker is able to accept the next ready batch from the partition.

Adaptive partition switching can be turned off by setting enable.adaptive.partitioning = false.

Note that these changes do not affect partitioning for keyed messages, only partitioning for unkeyed messages.

Compatibility, Deprecation, and Migration Plan

  • No compatibility, deprecation, migration plan.  This fixes a problem with current implementation.
  • Users can continue to use their own partitioners--if they want to implement a partitioner that switches partitions based on batch creations, they can use the onNewBatch(String topic, Cluster cluster) method to implement the feature.

Rejected Alternatives

As an alternative to allowing to return -1 from the Partitioner.partition method to indicate that the producer should execute default partitioning logic, it was considered to provide a callback interface that could feed information from producer back to the partitioner, as the following:

public interface Partitioner extends Configurable, Closeable {

    /**
     * Callbacks from the producer
     */
    interface Callbacks {
       /**
         * Get record size in bytes.  The keyBytes and valueBytes may present skewed view of the number
         * of bytes produced to the partition.  In addition, the callback takes into account the following:
         *  1. Headers
         *  2. Record overhead
         *  3. Batch overhead
         *  4. Compression
         *
         * @param partition The partition we need the record size for
         * @return The record size in bytes
         */
       int getRecordSize(int partition);

        /**
         * Calculate the partition number.  The producer keeps stats on partition load
         * and can use it as a signal for picking up the next partition.
         *
         * @return The partition number, or -1 if not implemented or not known
         */
       default int nextPartition() {
           return -1;
       }
    }

   // ... <skip> ...

   /**
     * Compute the partition for the given record.
     *
     * @param topic The topic name
     * @param key The key to partition on (or null if no key)
     * @param keyBytes The serialized key to partition on( or null if no key)
     * @param value The value to partition on or null
     * @param valueBytes The serialized value to partition on or null
     * @param callbacks The record size and partition callbacks (see {@link Partitioner#Callbacks})
     * @param cluster The current cluster metadata
     */
    default int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes,
                          Callbacks callbacks, Cluster cluster) {
        return partition(topic, key, keyBytes, value, valueBytes, cluster);
    }

   // ... <skip> ...
}

The getRecordSize callback method is needed to calculate the number of bytes in the record, the current information is not enough to calculate it accurately.  It doesn't have to be 100% precise, but it needs to avoid systemic errors that could lead to skews over the long run (it's ok if, say, compression rate was a little bit off for one batch, it'll converge over the long run) and it should roughly match the batch size (e.g. we have to apply compression estimates if compression is used, otherwise we'll systemically switch partition before batch is full).  See also the comments in the code snippet.

The nextPartition callback method effectively delegates partition switching logic back to producer.

This was an attempt to preserve the role separation between core producer logic and partitioner logic, but in reality it led to complicated interface (hard to understand the purpose without digging into implementation specifics and not really useful for other custom producers) and the logic that is logically tightly coupled (hard to understand partitioner logic without understanding producer logic and vice versa) but physically split between partitioner and core producer.

After doing that we realized that the desired encapsulation of default partitioning logic within default partitioner was broken anyway, so we might as well hoist the default partitioning logic into producer and let the default partitioner just inform the producer that the default partitioning logic is desired.  Hoisting the logic into producer was also slightly more efficient, as the split logic required multiple lookups into various maps as it transitioned between producer and partitioner, now (with returning -1) lookup is made once and the logic runs in one go.


  • No labels