Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We can improve this with the new APIs from KIP-360. When the coordinator times out a transaction, it can remember that fact and allow the existing producer to claim the bumped epoch and continue.

Public Interfaces

We will add a retriable error code to allow producer distinguish a fatal fencing vs a soft retry after server side timeout:

Code Block
TRANSACTION_TIMED_OUT(90, "The last ongoing transaction timed out on the coordinator, should retry initialization with current epoch", TransactionTimedOutException::new);

To be able to recognize clients that are capable of handling this new error, we need to bump some transaction related APIs version by 1, to be specific:

  1. AddPartitionsToTxn to v2 
  2. AddOffsetsToTxn to v2
  3. EndTxn to v2

Proposed Changes

The workflow shall look like:

...

2. Any transactional requests from the old epoch result in a new TRANSACTION_TIMED_OUT error code, which is propagated to the application. This mechanism applies to all producer ↔ transaction coordinator APIs:

  • AddPartitionsToTransaction
  • AddOffsetsToTransaction
  • EndTransaction

3. The producer recovers by sending InitProducerId with the current epoch. The coordinator returns the bumped epoch.

One extra issue that needs to be addressed is how to handle `ProducerFenced` from Produce requests. Partition leaders will not generally know if a bumped epoch was the result of a timed out transaction or a fenced producer. In this case, new producers can treat `ProducerFenced` as abortable when they come from Produce responses. Consequently Producer would try to abort the transaction to detect whether this was due to a timeout or otherwise, as end transaction call shall also be protected by the new transaction timeout retry logic.

Public Interfaces

We will add a retriable error code to allow producer distinguish a fatal fencing vs a soft retry after server side timeout:

Code Block
TRANSACTION_TIMED_OUT(90, "The last ongoing transaction timed out on the coordinator, should retry initialization with current epoch", TransactionTimedOutException::new);

To be able to recognize clients that are capable of handling this new error, we need to bump some transaction related APIs version by 1, to be specific:

  1. AddPartitionsToTransaction to v2 
  2. AddOffsetsToTransaction to v2
  3. EndTransaction to v2

Compatibility, Deprecation, and Migration Plan

...