You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Under Discussion

Discussion thread: here

JIRA: here

Abstract

This KIP proposes adding metrics functionality to Kafka source connectors to track and report failures encountered during the record pulling process. By providing granular metrics, it becomes easier to monitor and diagnose issues related to record pull failures, enabling users to take appropriate actions for fault tolerance and troubleshooting.

Motivation

Currently, Kafka source connectors do not provide built-in metrics specifically related to failures encountered during the record pulling process. Monitoring the health and performance of connectors is crucial for effectively managing data pipelines. By introducing metrics for record pull failures, operators can gain visibility into potential bottlenecks, connectivity issues, or other problems hindering the consistent flow of data.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

I propose adding new metrics to Kafka Connect, "source-record-poll-error-total" and "source-record-poll-error-rate" that can be used to monitor failures during polling.

source-record-poll-error-total - The total number of times a source connector failed to poll data from the source. This will include both retryable and non-retryable exceptions.

source-record-poll-error-rate - The total number of times a source connector failed to poll data from the source per second.

Proposed Changes

This proposal suggests the following modifications to the Kafka source connector framework:

  1. Introduce new metrics: Two new metrics will be added to the source connector framework:

    • source-record-poll-error-total: This metric will track the total number of failures encountered during the record pulling process.
    • source-record-poll-error-rate: This metric will provide the failure rate as a percentage, calculated based on the total number of records pulled and the number of failures.
  2. Increment metrics on failure: When a source connector fails to pull records due to any error or exception, the connector.pull.failure.count metric will be incremented by one, and the connector.pull.failure.rate metric will be updated accordingly.

  3. Reporting metrics: The connector framework will expose these new metrics via JMX (Java Management Extensions) for monitoring and integration with existing monitoring systems. Operators can configure their monitoring tools to collect and analyze these metrics to gain insights into the health and performance of the source connectors.

  4. Compatibility and backward compatibility: These changes will be backward compatible, ensuring that existing source connectors can be upgraded to the new version without requiring any modifications to their configuration or behavior.

  5. Documentation: The Kafka documentation will be updated to include details on the new metrics and their usage. It will provide guidance on how to leverage these metrics for monitoring and troubleshooting purposes.

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users?
  • If we are changing behavior how will we phase out the older behavior?
  • If we need special migration tools, describe them here.
  • When will we remove the existing behavior?

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • No labels