Status

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: KAFKA-14112

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The current Kafka architecture lacks a built-in mechanism to directly track the replication offset lag, i.e the number of to-be-replicated records. This could be essential for monitoring and maintaining the health and performance of data replication processes. The replication offset lag, defined as the difference between the last end offset of the source partition (LEO) and the last replicated source offset (LRO), is beneficial for understanding the progress and potential bottlenecks in data replication scenarios.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

A public interface is any change to the following:

Proposed Changes

This proposal aims to enhance Kafka's monitoring capabilities by introducing a new metric to track the replication offset lag for a given topic-partition. The metric will be calculated by taking the difference between the LEO, which will be fetched during the source task's poll loop, and the LRO, which will be stored in an in-memory "cache" and updated during the task's producer callback.

The proposed changes involve the following steps:

  1. LRO Tracking Mechanism:

  2. LEO Acquisition during Poll Loop:

  3. Expose Replication Offset Lag Metric:

Compatibility, Deprecation, and Migration Plan

Test Plan

Metrics will be tested in org.apache.kafka.connect.mirror.MirrorSourceTaskTest in a unit test fashion. Unfortunately there are no system tests that I could find where this change could be tested, but I am happy to do it if there is a suitable place where it is already done and that I have missed, or to implement a new system test.

Rejected Alternatives

It might be argued that the existing "replication-latency-ms" metric already provides satisfactory information about the replication lag, but I think providing an exact amount of record lagging behind is beneficial in addition to providing a time-lag.

No other alternatives were considered.