Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Under Discussion

Discussion thread: here

JIRA: here

Abstract

This KIP proposes adding new metrics functionality to Kafka source connectors to track and report failures encountered during the record pulling polling process. By providing these granular metrics, it becomes easier to monitor and diagnose issues related to record pull polling failures, enabling users to take appropriate actions for fault tolerance and troubleshooting.

Motivation

Currently, there is no metric in Kafka source connectors do not provide built-in metrics specifically related to failures encountered during the record pulling process. Monitoring the health and performance of connectors is crucial for effectively managing data pipelines. By introducing metrics for record pull failures, operators can gain visibility into potential bottlenecks, connectivity issues, or other problems hindering the consistent flow of data.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

Connect to track when a source connector fails to poll data from the source. This information would be useful to operators and developers to visualize, monitor and alert when the connector fails to poll records from the source.

Existing metrics like kafka_producer_producer_metrics_record_error_total and kafka_connect_task_error_metrics_total_record_failures only cover failures when producing data to the Kafka cluster but not when the source task fails with a retryable exception or ConnectException.

Polling from source can fail due to unavailability of the source system or errors with the connect configuration. Currently, this cannot be monitored directly using metrics and instead operators have to rely on log diving which is not consistent with how other metrics are monitored.

Public Interfaces

The new metrics will be added at the granularity of a task and will be exposed via JMX interface similar to existing metrics,

Metrics group: Ref

kafka.connect:type=task-error-metrics,connector="{connector}",task="{task}"

Metric Names:I propose adding new metrics to Kafka Connect, "source-record-poll-error-total" and "source-record-poll-error-rate" that can be used to monitor failures during polling.

source-record-poll-error-total - The total number of times a source connector task failed to poll data from the source system. This will include both retryable and non-retryable exceptions.

source-record-poll-error-rate - The total number of times a source connector failed to poll data from the source per secondrate of total errors encountered per second while polling the source system. This is useful for calculating the errors as a percentage.

Proposed Changes

This proposal suggests the following modifications to the Kafka source connector framework:

  1. Introduce Record new metrics: Two new metrics will be added to the source connector framework:
    • source-record-poll-error-total: This metric will track the total number of failures encountered during the record pulling process.
    • source-record-poll-error-rate: This metric will provide the failure rate as a percentage, calculated based on the total number of records pulled and the number of failures.
    The poll function in AbstractWorkerSourceTask will have an exception handling block that will record these errors. The error metric will be incremented whenever the exception is encountered.
  2. Register new metrics: The recorded metrics will be added to ConnectMetricsRegistry along with other Connect metricsIncrement metrics on failure: When a source connector fails to pull records due to any error or exception, the connector.pull.failure.count metric will be incremented by one, and the connector.pull.failure.rate metric will be updated accordingly.

  3. Reporting metrics: The connector framework will expose these new metrics via JMX (Java Management Extensions) for monitoring and integration with existing monitoring systems. Operators can configure their monitoring tools to collect and analyze these metrics to gain insights into the health and performance of the source connectors.Compatibility and backward compatibility: These changes will be backward compatible, ensuring that existing source connectors can be upgraded to the new version without requiring any modifications to their configuration or behavior.

  4. Documentation: The Kafka documentation will be updated to include details on the new metrics and their usage. It will provide guidance on how to leverage these metrics for monitoring and troubleshooting purposes.

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users?
    • This change only adds new metrics that users can utilize to improve their monitoring of source connectors. There is no impact to existing metrics.
  • If we are changing behavior how will we phase out the older behavior?
    • N/A
  • If we need special migration tools, describe them here.
    • N/A
  • When will we remove the existing behavior?
    • N/A

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

  1. Unit testing:Unit tests will be added to make sure the logic of adding the metrics is correct. It will also validate the correctness of the metric.
  2. The change will also be manually tested to ensure that it works with real source connectors.

Rejected Alternatives

...

  1. The alternative is to monitor the source system for failures but it is not always possible when the source system does not provide these metrics. It is also good to know the errors encountered, from Kafka Connect perspective rather than an external system.
  2. The other alternative is to push the metric publishing logic to connectors but that's not a good idea as every connector implementation has to implement this.