Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Mark as discarded

Table of Contents

Status

Current state: Under Discussion Discarded in favour of KIP-112 and KIP-113

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]

...

2. Exception handler component on exception does the following:

    2.1. Detect directory that is no longer available and put it to offline - all operations with the respective directory are stopped, new logs are not created there

    2.2. Detect partitions that were lost

    2.3. Notify controller that specific partitions need to be restarted

...

1. Define logs and partitions which were stored in the unavailable directory

2. Abort and pause all future cleaning for defined partitions

3. Update recovery checkpoints list to remove the respective directory

3. Abort cleaning for defined partitions

4. Remove defined logs from the logs pool and update logDirs (so that scheduled jobs - kafka-log-retentionkafka-log-flusher and kafka-recovery-point-checkpoint are not executed on logs put to offline)

Note: currently scheduled jobs are not executed in lock and logs pool is not protected by lock, so with these changes data races are possible. It should be considered how changing jobs (executing them in lock) may affect performance.

Partitions Restart

Partitions restart means re-electing leader, in-sync replicas and assigned replicas so that partitions that were lost on some broker due to an IO error were re-replicated on that broker.

...

All edge cases (like new isr set is empty) are handled similarly to offlinePartionLeaderSelector. 

Open questions

 1. Disk availability check operation

...

 Does it makes sense to retry operation before firing restart partitions?

Compatibility, Deprecation, and Migration Plan

No public interfaces changes. Users won't have to restart brokers on IO errors (e.g. after disk becomes unavailable).

...