Status

Current state: Under Discussion

Discussion thread: here

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Presently, Kafka Streams provides users with two options for handling a DeserializationException via the DeserializationExceptionHandler interface:

FAIL - throw an Exception that causes the stream thread to fail. This will either cause the whole application instance to exit, or the stream thread will be replaced and restarted. Either way, the failed Task will end up being resumed, either by the current instance or after being rebalanced to another, causing a cascading failure until a user intervenes to address the problem.
CONTINUE - discard the record and continue processing with the next record. This can cause data loss if the record triggering the DeserializationException should be considered a valid record. This can happen if an upstream producer changes the record schema in a way that is incompatible with the streams application, or if there is a bug in the Deserializer (for example, failing to handle a valid edge-case).

The user can currently choose between data loss, or a cascading failure that usually causes all processing to slowly grind to a halt.

Public Interfaces

Modified Interfaces

org.apache.kafka.streams.errors.DeserializationExceptionHandler.DeserializationHandlerResponse

        /* suspend processing the current Task, but continue other Tasks */
        SUSPEND(2, "SUSPEND");

New Interfaces

org.apache.kafka.streams.errors.LogAndSuspendExceptionHandler

public class LogAndSuspendExceptionHandler implements DeserializationExceptionHandler {
    private static final Logger log = LoggerFactory.getLogger(LogAndSuspendExceptionHandler.class);

    @Override
    public DeserializationHandlerResponse handle(final ProcessorContext context,
                                                 final ConsumerRecord<byte[], byte[]> record,
                                                 final Exception exception) {

        log.error("Exception caught during Deserialization, " +
                        "taskId: {}, topic: {}, partition: {}, offset: {}",
                context.taskId(), record.topic(), record.partition(), record.offset(),
                exception);

        return DeserializationHandlerResponse.SUSPEND;
    }

    @Override
    public void configure(final Map<String, ?> configs) {
        // ignore
    }
}

org.apache.kafka.streams.KafkaStreams

/**
 * Resume processing the {@link Task} specified by its {@link TaskId id}.
 * <p>
 * This method resumes a {@link Task} that was {@link Task.State.SUSPENDED} due to an {@link
 * DeserializationExceptionHandler.DeserializationHandlerResponse.SUSPEND error}.
 * <p>
 * If the given {@link Task} is not {@link Task.State.SUSPENDED}, no action will be taken and
 * {@code false} will be returned.
 * <p>
 * Otherwise, this method will attempt to transition the {@link Task} to {@link Task.State.RUNNING},
 * and return {@code true}, if successful.
 * 
 * @return {@code true} if the {@link Task} was {@link Task.State.SUSPENDED} and was successfully
 *         transitioned to {@link Task.State.RUNNING}, otherwise {@code false}.
 */
public boolean resume(final TaskId task);

Proposed Changes

DeserializationHandlerResponse.SUSPEND suspends the Task that has encountered the error, but continues to process other Tasks normally. When a Task is SUSPENDED, it is still assigned as an active Task to the instance, but it will not consume or process any records.

Users could observe these errors through their usual observability solutions, by looking for:

The ERROR log message accompanying a DeserializationException.
The consumer failing to consume the subset of partitions that are affected by the error; usually via a "consumer lag" metric.

Once detected, users may intervene by, for example:

If the record should be valid: fixing the bug, in the application that causes the record to fail to deserialize. Once the bug has been fixed, the user would shutdown the application, deploy a fixed build and restart it. Once restarted, any SUSPENDED Tasks would automatically start running again from the record that originally produced the error.
If the record is invalid (e.g. corrupt data): advancing the consumer offsets, either via an external tool, or by a user-supplied application API. Once the offsets have been advanced, the user could either restart their application instance, or provide an API that resumes the SUSPENDED Task, if they wish to minimize downtime.

Implementation details

When a DeserializationExceptionHandler returns SUSPEND, the current Task will be suspended via InternalProcessorContext. However, if the Task is a TaskType.GLOBAL Task, it will automatically upgrade the response to FAIL, as the GlobalTask cannot be SUSPENDED; suspending the global Task without also suspending all other Tasks on the instance would cause them to work with stale data if they read from or join against any global tables.

When the Task is SUSPENDED, we will ensure that the offset of the last successfully processed record(s) on that Task are committed. This ensures that:

If the user fixes a bug and restarts the application, it will continue from the record that failed, and will not re-process a previously successfully processed record.
If the user wants to advance the consumer offsets past the "bad" record, they can simply use: kafka-consumer-groups --reset-offsets --topic <topic>:<partition> --shift-by 1 to skip the bad message before resuming the Task.

Compatibility, Deprecation, and Migration Plan

Since this is new functionality, it should not modify the behaviour of the system unless the new SUSPEND response is used in a DeserializationExceptionHandler.
No APIs are deprecated or need migration.

Test Plan

An integration test will verify that, when suspending a failed Task, the consumer offset of the last successfully processed record(s) are committed.
A unit test suite will verify that the LogAndSuspendExceptionHandler properly suspends the StreamTask.
- A unit test will also verify that Global Tasks are never SUSPENDED.

Rejected Alternatives

No alternatives have been considered.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Modified Interfaces

New Interfaces

Proposed Changes

Implementation details

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-990: Capability to SUSPEND Tasks on DeserializationException

Status

Motivation

Public Interfaces

Modified Interfaces

New Interfaces

Proposed Changes

Implementation details

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives