Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion thread: here

JIRA: here

The proposal discussed in this KIP is implemented in this pull request.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

There are several places in Connect during which failures may occur. Any failure to deserialize, convert, process, or read/write a record in Kafka Connect can cause a task to fail. Although some errors can be addressed with SMTs or transformations or custom converters that check for malformed data, in general it is difficult to ensure correct and valid data or to tell Connect to skip problematic records.

...

This proposal aims to change the Connect framework to allow it to automatically deal with errors while processing records in a Connector. By default, Connect will fail immediately when an error occurs, which is the previous Connect behavior. Therefore, all new behaviors must be explicitly enabled.The proposal discussed in this KIP is implemented in this pull request.

Public Interfaces

Several new behaviors for handling and reporting errors are introduced, and all must be configured in the individual connector configurations.

...

Configuration NameDescriptionDefault ValueDomain
errors.retriesretry.limitThe maximum number of retries before failing.0[-1, 0, 1, ... Long.MAX_VALUE], where -1 means infinite retries.
errors.retriesretry.delay.max.msThe maximum duration between two consecutive retries (in milliseconds).60000[1, ... Long.MAX_VALUE]

...

Metric/Attribute Name

Description

Implemented
total-record-failuresTotal number of failures seen by this task.2.0.0
total-records-skippedTotal number of records skipped by this task.2.0.0
total-retriesTotal number of retries made by this task.2.0.0
total-failures-loggedThe number of messages that was logged into either the dead letter queue or with Log4j.2.0.0
dlqdeadletterqueue-records-producedNumber of records successfully produced to the dead letter queue.2.0.0
dlqdeadletterqueue-produce-failuresNumber of records which failed to produce correctly to the dead letter queue.2.0.0
last-failure-timestampThe timestamp when the last failure occurred in this task.2.0.0

...

There are two behavioral changes introduced by this KIP. First, a failure in any stage will be reattempted, thereby “blocking” the connector. This helps in situations where time is needed to manually update an external system, such as manually correcting a schema in the Schema Registry. More complex causes, such as requiring code changes or corruptions in the data that can’t be fixed externally, will require the worker to be stopped, data to be fixed and then the Connector to be restarted. In the case of data corruption, the topic might need to be cleaned up too. If the retry limit for a failure is reached, then the tolerance limit is used to determine if this record should be skipped, or if the task is to be killed.

The second behavioral change is introduced in how we log these failures. Currently, only the exception which kills the task is written with the application logs. With the additions presented in this KIP, the following context will be logged in JSON format: 

...