Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

When starting a broker either because it is a new broker, a broker was upgraded or a failed broker is restarting. Loading the state represented by the __cluster_metadata topic partition is required before the broker is available. With snapshot based of the in-memory state Kafka can be much more aggressive on how often it generates the snapshot and the snapshot can include up to the high watermark. In the future this mechanism will also be useful for high-throughput topic partitions like the Group Coordinator and Transaction Coordinator.

...

If events/deltas are not allowed in the replicated log for the __cluster_metadata topic partition then the Kafka Controller and each broker needs to persist 40 bytes per topic partitions hosted by the broker. In the example above that comes out to 3.81 MB. The Kafka Controller will need to replicate 3.81 MB to each broker in the cluster (10) or 38.14 MB.

...

This KIP assumes that KIP-595 has been approved and implemented. With this improvement replicas will continue to replicate their log among each others. Replicas will snapshot their log independent of each other. This is Snapshots will be consistent because the log themselves logs are consistent . Snapshot will be sent to replicas that across replicas. Follower and observer replicas fetch the snapshots from the leader they attempt to fetch an offset from the leader and the leader doesn’t have that offset in the log because it is included in the snapshot.

Generating and loading the snapshot will be delegated to the Kafka Controller. The Kafka Controller will notify the Kafka Raft client when it has generated a snapshot and up to which offset is included in the snapshot. The Kafka Raft client will notify the Kafka Controller when a new snapshot has been fetched from the leader.

This KIP focuses on the changes required for use by the Kafka Controller's __cluster_metadata topic partition but we should be able to extend it to use it in other internal topic partitions like This improvement will be used for internal topics (__consumer_offset, __cluster_metadata and __transaction) but it should be general enough to eventually use on external topics with consumers that support this feature.

Proposed Changes

Topic partitions with snapshot for the cleanup.policy will have the log compaction delegated to the state machine. The snapshot algorithm explain in this document assumes that the state machine is in memory. The state machine for both the Kafka Controller and Group Coordinator are already in memory. Supporting state machine that don’t fit in memory are outside the scope of this document. See the Future Work section for details.

...