Page History

...

Discussion thread: here

JIRA: here

The proposal discussed in this KIP is implemented in this pull request.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

There are several places in Connect during which failures may occur. Any failure to deserialize, convert, process, or read/write a record in Kafka Connect can cause a task to fail. Although some errors can be addressed with SMTs or transformations or custom converters that check for malformed data, in general it is difficult to ensure correct and valid data or to tell Connect to skip problematic records.

...

This proposal aims to change the Connect framework to allow it to automatically deal with errors while processing records in a Connector. By default, Connect will fail immediately when an error occurs, which is the previous Connect behavior. Therefore, all new behaviors must be explicitly enabled.The proposal discussed in this KIP is implemented in this pull request.

Public Interfaces

Several new behaviors for handling and reporting errors are introduced, and all must be configured in the individual connector configurations.

...

Configuration Name	Description	Default Value	Domain
errors.retriesretry.limit	The maximum number of retries before failing.	0	[-1, 0, 1, ... Long.MAX_VALUE], where -1 means infinite retries.
errors.retriesretry.delay.max.ms	The maximum duration between two consecutive retries (in milliseconds).	60000	[1, ... Long.MAX_VALUE]

...

Metric/Attribute Name	Description	Implemented
total-record-failures	Total number of failures seen by this task.	2.0.0
total-records-skipped	Total number of records skipped by this task.	2.0.0
total-retries	Total number of retries made by this task.	2.0.0
total-failures-logged	The number of messages that was logged into either the dead letter queue or with Log4j.	2.0.0
dlqdeadletterqueue-records-produced	Number of records successfully produced to the dead letter queue.	2.0.0
dlqdeadletterqueue-produce-failures	Number of records which failed to produce correctly to the dead letter queue.	2.0.0
last-failure-timestamp	The timestamp when the last failure occurred in this task.	2.0.0

...

There are two behavioral changes introduced by this KIP. First, a failure in any stage will be reattempted, thereby “blocking” the connector. This helps in situations where time is needed to manually update an external system, such as manually correcting a schema in the Schema Registry. More complex causes, such as requiring code changes or corruptions in the data that can’t be fixed externally, will require the worker to be stopped, data to be fixed and then the Connector to be restarted. In the case of data corruption, the topic might need to be cleaned up too. If the retry limit for a failure is reached, then the tolerance limit is used to determine if this record should be skipped, or if the task is to be killed.

The second behavioral change is introduced in how we log these failures. Currently, only the exception which kills the task is written with the application logs. With the additions presented in this KIP, the following context will be logged in JSON format:

...

Space shortcuts

Child pages

Versions Compared

Old Version 17

New Version 18

Key