Table of Contents |
---|
Status
Current state: Under Discussion
Discussion thread: here
JIRA: here
Abstract
This KIP proposes adding new metrics to Kafka source connectors to track and report failures encountered during the record polling process. By providing these granular metrics, it becomes easier to monitor and diagnose issues related to record polling failures, enabling users to take appropriate actions for fault tolerance and troubleshooting.
Motivation
Currently, there is no metric in Kafka Connect to track when a source connector fails to poll data from the source. This information would be useful to operators and developers to visualize, monitor and alert when the connector fails to poll records from the source.
...
Polling from source can fail due to unavailability of the source system or errors with the connect configuration. Currently, this cannot be monitored directly using metrics and instead operators have to rely on log diving which is not consistent with how other metrics are monitored.
Public Interfaces
The new metrics will be added at the granularity of a task and will be exposed via JMX interface similar to existing metrics,
...
source-record-poll-error-rate - The rate of total errors encountered per second while polling the source system. This is useful for calculating the errors as a percentage.
Proposed Changes
This proposal suggests the following modifications to the Kafka source connector framework:
- Record new metrics: The poll function in AbstractWorkerSourceTask will have an exception handling block that will record these errors. The error metric will be incremented whenever the exception is encountered.
Register new metrics: The recorded metrics will be added to ConnectMetricsRegistry along with other Connect metrics.
Reporting metrics: The connector framework will expose these new metrics via JMX (Java Management Extensions) for monitoring and integration with existing monitoring systems. Operators can configure their monitoring tools to collect and analyze these metrics to gain insights into the health and performance of the source connectors.
Documentation: The Kafka documentation will be updated to include details on the new metrics and their usage. It will provide guidance on how to leverage these metrics for monitoring and troubleshooting purposes.
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- This change only adds new metrics that users can utilize to improve their monitoring of source connectors. There is no impact to existing metrics.
- If we are changing behavior how will we phase out the older behavior?
- N/A
- If we need special migration tools, describe them here.
- N/A
- When will we remove the existing behavior?
- N/A
Test Plan
Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?
- Unit testing:Unit tests will be added to make sure the logic of adding the metrics is correct. It will also validate the correctness of the metric.
- The change will also be manually tested to ensure that it works with real source connectors.
Rejected Alternatives
- The alternative is to monitor the source system for failures but it is not always possible when the source system does not provide these metrics. It is also good to know the errors encountered, from Kafka Connect perspective rather than an external system.
- The other alternative is to push the metric publishing logic to connectors but that's not a good idea as every connector implementation has to implement this.