...
We can improve this with the new APIs from KIP-360. When the coordinator times out a transaction, it can remember that fact and allow the existing producer to claim the bumped epoch and continue.
Public Interfaces
We will add a retriable error code to allow producer distinguish a fatal fencing vs a soft retry after server side timeout:
Code Block |
---|
TRANSACTION_TIMED_OUT(90, "The last ongoing transaction timed out on the coordinator, should retry initialization with current epoch", TransactionTimedOutException::new); |
To be able to recognize clients that are capable of handling this new error, we need to bump some transaction related APIs version by 1, to be specific:
- AddPartitionsToTxn to v2
- AddOffsetsToTxn to v2
- EndTxn to v2
Proposed Changes
The workflow shall look like:
...
2. Any transactional requests from the old epoch result in a new TRANSACTION_TIMED_OUT error code, which is propagated to the application. This mechanism applies to all producer ↔ transaction coordinator APIs:
- AddPartitionsToTransaction
- AddOffsetsToTransaction
- EndTransaction
3. The producer recovers by sending InitProducerId with the current epoch. The coordinator returns the bumped epoch.
One extra issue that needs to be addressed is how to handle `ProducerFenced` from Produce requests. Partition leaders will not generally know if a bumped epoch was the result of a timed out transaction or a fenced producer. In this case, new producers can treat `ProducerFenced` as abortable when they come from Produce responses. Consequently Producer would try to abort the transaction to detect whether this was due to a timeout or otherwise, as end transaction call shall also be protected by the new transaction timeout retry logic.
Public Interfaces
We will add a retriable error code to allow producer distinguish a fatal fencing vs a soft retry after server side timeout:
Code Block |
---|
TRANSACTION_TIMED_OUT(90, "The last ongoing transaction timed out on the coordinator, should retry initialization with current epoch", TransactionTimedOutException::new); |
To be able to recognize clients that are capable of handling this new error, we need to bump some transaction related APIs version by 1, to be specific:
- AddPartitionsToTransaction to v2
- AddOffsetsToTransaction to v2
- EndTransaction to v2
Compatibility, Deprecation, and Migration Plan
...