Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This proposal provides the tooling support needed to detect, analyze, and safely recover from a hanging transaction. 

Detection

: The first issue to address is how a user can find topic partitions that may have hanging transactions. Preferably we want a metric so that alerts can be triggered proactively rather than waiting for users to complain. Today, Kafka exposes a partition-level "LastStableOffsetLag" metric which indicates how far behind the LSO is from the log end offset. When there is a hanging transaction, the LSO lag will tend to grow indefinitely. However, it is difficult to assign an alert threshold because it depends on the characteristics of the application (e.g. transaction duration and throughput).

...

We expect that users will alert on positive values of `PartitionsWithLateTransactionsCount`. They can then use `MaxActiveTransactionDuration` or one of the APIs described below to find the topic partition.

Analysis

: Hanging transactions are the result of an inconsistent state between the replicas and the transaction coordinator. It is not easy to analyze a hanging transaction if one is expected today because there is little visibility into either the producer state maintained by each replica or the transaction state of the coordinator. We propose to add three new APIs to address this gap:

...

  1. Use `DescribeProducers` to collect the set of ProducerIds which have transactions exceeding the max transaction timeout
  2. Use `ListTransactions` to the available brokers to find the the TransactionalIds associated with these ProducerIds.
  3. Finally, use `DescribeTransactions` to validate the transaction state and ensure it is safe to abort.

Recovery

The : The remaining problem to solve is how to safely abort a hanging transaction. We propose to extend the `WriteTxnMarker` API so that it can be used by the Kafka AdminClient. Currently we use the coordinator epoch (which is the leader epoch of the associated __transaction_state partition) as a kind of concurrency control. Basically partition leaders will not accept non-monotonic updates for a given `ProducerId`. We need to ensure that writes from the AdminClient do not interfere with this mechanism.

...