Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This KIP focuses on the changes required for use by the Kafka Controller's __cluster_metadata topic partition but we should be able to extend it to use it in other internal topic partitions like __consumer_offsets and __transaction.

Terminology

Kafka Controller

The component that generates snapshots, reads snapshots and reads logs for voter replicas of the topic partition __cluster_metadata.

Active Controller

The Kafka Controller that is also a leader of the __cluster_metadata topic partition. This component is allowed to append to the log.

Metadata Cache

The component that generates snapshots, reads snapshots and reads logs for observer replicas of the topic partition __cluster_metadata. This is needed to reply to Metadata RPCs, for connection information to all of the brokers, etc.

Proposed Changes

The __cluster_metadata topic will have snapshot as the cleanup.policy. The Kafka Controller will have an in-memory representation of the replicated log up to at most the high-watermark. When performing a snapshot the Kafka Controller will serialize this in-memory state to disk. This snapshot file on disk is described by the largest offset and epoch from the replicated log that has been included.

The Kafka Controller will notify the Kafka Raft client when it has finished generating a new snapshots. It is safe to truncate a prefix of the log up to the latest snapshot. The __cluster_metadata topic partition will have the latest snapshot and zero or more older snapshots. Having multiple snapshots is useful for minimizing re-fetching of the snapshot when a new snapshot is generatedsnapshots. These additional snapshots would have to be deleted, this is described in "When to Delete Snapshots". 

...

The followers and observers can concurrently fetch both the snapshot and the replicated log. During leader election, followers with incomplete or missing snapshot that violate the invariant in the "Validation of Snapshot and Log" section will send a vote request and response as if they had an empty log.

...

Followers and observers will increase their log begin offset to the value sent on the fetch response as long as the local state machine Kafka Controller has generated a snapshot that includes to the log begin offset minus one.

...

The broker will delete any snapshot with a latest offset and epoch (SnapshotOffsetAndEpoch) less than the LBO - 1. The That means that the question of when to delete a snapshot is the same as when to increase the LBO which is covered the in the section “When To Increase the Log Begin Offset”.

...

Validation of Snapshot and Log

...

For each replica we have the following invariant:

  1. If the LBO is 0, there are zero or more snapshots where the offset of the snapshot is between LBO and the High-Watermark.
  2. If the LBO is greater than 0, there are one or more snapshots where the offset of the snapshot is between

...

  1. LBO - 1 and the High-Watermark.

If this is not the case then the replica should assume that the log is empty. Once the replica discovers the leader for the partition it can bootstrap itself by using the Fetch and FetchSnapshot RPCs as described in this document.

...

  1. The __cluster_metadata topic will have an a cleanup.policy value of snapshot. This configuration can only be read. Updates to this configuration will not be allowed.

  2. controller.snapshot.minimum.records is the minimum number of records committed after the latest snapshot before the Kafka Controller will perform another snapshot. See section "When to Snapshot"

  3. max.replication.lag.ms - The maximum amount of time that leader will wait for an offset to get replicated to all of the live replicas before advancing the LBO. See section “When to Increase the Log Begin Offset”.

...