Current state: Under Discussion
Discussion thread: here
JIRA: here
This KIP proposes adding new metrics to Kafka source connectors to track and report failures encountered during the record polling process. By providing these granular metrics, it becomes easier to monitor and diagnose issues related to record polling failures, enabling users to take appropriate actions for fault tolerance and troubleshooting.
Currently, there is no metric in Kafka Connect to track when a source connector fails to poll data from the source. This information would be useful to operators and developers to visualize, monitor and alert when the connector fails to poll records from the source.
Existing metrics like kafka_producer_producer_metrics_record_error_total and kafka_connect_task_error_metrics_total_record_failures only cover failures when producing data to the Kafka cluster but not when the source task fails with a retryable exception or ConnectException.
Polling from source can fail due to unavailability of the source system or errors with the connect configuration. Currently, this cannot be monitored directly using metrics and instead operators have to rely on log diving which is not consistent with how other metrics are monitored.
The new metrics will be added at the granularity of a task and will be exposed via JMX interface similar to existing metrics,
Metrics group: Ref
kafka.connect:type=task-error-metrics,connector="{connector}",task="{task}" |
Metric Names:
source-record-poll-error-total - The total number of times a source connector task failed to poll data from the source system. This will include both retryable and non-retryable exceptions.
source-record-poll-error-rate - The rate of total errors encountered per second while polling the source system. This is useful for calculating the errors as a percentage.
This proposal suggests the following modifications to the Kafka source connector framework:
Register new metrics: The recorded metrics will be added to ConnectMetricsRegistry along with other Connect metrics.
Reporting metrics: The connector framework will expose these new metrics via JMX (Java Management Extensions) for monitoring and integration with existing monitoring systems. Operators can configure their monitoring tools to collect and analyze these metrics to gain insights into the health and performance of the source connectors.
Documentation: The Kafka documentation will be updated to include details on the new metrics and their usage. It will provide guidance on how to leverage these metrics for monitoring and troubleshooting purposes.
Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?