Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To make the upgrade completely compatible with current EOS transaction semantics, we need to be able to distinguish clients who are making progress on the same input source but using different transactional id. It is possible to have two different types of clients within the same consumer group. Imagine a case as a Kafka Streams applications, where half of the instances are using old task producer API, while the other half of them use new consumer group API. This fencing could be done by leveraging transactional offset commit protocol which contains a consumer group id and topic partitions. Group coordinator could build a reverse mapping from topic partition to producer.id, and remember which producer has been contributing offset commits to each specific topic partition. In this way, when we upgrade to the new API, group coordinator will be actively checking this map upon receiving `TxnOffsetCommit`. If the stored `producer.id` doesn't match the one defined in request while , or the producer.id matches but the epoch was also is smaller, coordinator would send out a `ConcurrentProducerCommitException` to stored producer by aborting any ongoing transaction associated with itin the response to shutdown this conflict producer immediately. This ensures us a smooth upgrade without worrying about old pending transactions. Also this suggests that it is not recommended to have two types of clients running in the same application which makes the fencing much harder.

Besides an active fencing mechanism, we also need to ensure 100% correctness during upgrade. This means no input data should be processed twice, even though we couldn't distinguish the client by transactional id anymore. The solution is to reject consume offset request by sending out PendingTransactionException to new client when there is pending transactional offset commits, so that new client shall start from a clean state instead of relying on transactional id fencing. Since it would be an unknown exception for old consumers, we will choose to send a COORDINATOR_LOAD_IN_PROGRESS exception to let it retry. When client receives PendingTransactionException, it will back-off and retry getting input offset until all the pending transaction offsets are cleared. This is a trade-off between availability and correctness, and in this case the worst case for availability is just waiting transaction timeout for one minute which should be trivial one-time cost during upgrade only. 

...