Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Exactly one once semantics (EOS) provides transactional message processing guarantees. Producers can write to multiple partitions atomically so that either all writes succeed or all writes fail. This can be used in the context of stream processing frameworks, such as Kafka Streams, to ensure exactly once processing between topics.

...

To make the upgrade completely compatible with current EOS transaction semantics, we need to be able to distinguish clients who are making progress on the same input source but using different transactional id. It is possible to have two different types of clients within the same consumer group. Imagine a case as a Kafka Streams applications, where half of the instances are using old task producer API, while the other half of them use new consumer group API. This fencing could be done by leveraging transactional offset commit protocol which contains a consumer group id and topic partitions. Group coordinator could build a reverse mapping from topic partition to producer.id, and remember which producer has been contributing offset commits to each specific topic partition. In this way, when we upgrade to the new API, group coordinator will be actively checking this map upon receiving `AddOffsetsToTxnRequest`receiving `TxnOffsetCommit`. If the stored `producer.id` doesn't match the one defined in request while the epoch was also smaller, coordinator would send out a `ConcurrentProducerCommitException` to stored producer by aborting any ongoing transaction associated with it. This ensures us a smooth upgrade without worrying about old pending transactions.

Besides an active fencing mechanism, we also need to ensure 100% correctness during upgrade. This means no input data should be processed twice, even though we couldn't distinguish the client by transactional id anymore. The solution is to reject consume offset request by sending out PendingTransactionException to new client when there is pending transactional offset commits, so that new client shall start from a clean state instead of relying on transactional id fencing. Since it would be an unknown exception for old consumers, we will choose to send a COORDINATOR_LOAD_IN_PROGRESS exception to let it retry. When client receives PendingTransactionException, it will back-off and retry getting input offset until all the pending transaction offsets are cleared. This is a trade-off between availability and correctness, and in this case the worst case for availability is just hitting the waiting transaction timeout for one minute which should be trivial cost during upgrade only. 

Rejected Alternatives

  • We discussed whether we want to have a new API to proactively abort ongoing transactions 
  • Producer Pooling:
  • Producer support multiple transactional ids:
  • Tricky rebalance synchronization:

...