Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. If Continue is true, the follower is expected to send another FetchSnapshotRequest with an Position of Position + len(Bytes) from the response.

  2. Copy the bytes in Bytes to the local snapshot file starting at file position Position.

  3. If Continue is set to false then fetching the snapshot finisedfinished. Notify the Kafka Controller that a new snapshot is available. The Kafka Controller will load this new snapshot and read records from the log after the snapshot offset.

...

Change leader when generating snapshots: Instead of supporting concurrent snapshots and writing to the client state, we require that only non-leaders generate snapshots. If the leader wants to generate a snapshot then it would relinquish leadership. We decided to not implement this for two reasons. 1. All replicas/brokers in the cluster will have a metadata cache which will apply the replicated log and allow read-only access by other components in the Kafka Broker. 2. Even though leader election within the replicated log may be efficient, communicating this new leader to all of the brokers and all of the external clients (admin clients, consumers, producers) may have a large latency.


Just fix the tombstone GC issue with log compaction: we have also talked about just fixing the tombstone garbage collecting issue as listed in motivation instead of introducing the snapshot mechanism. The key idea is that fetchers either from follower or from consumer needs to be notified if they "may" be missing some tombstones that have deprecated some old records. Since the offset span between the deprecated record and the tombstone for the same key can be arbitrarily long — for example, we have a record with key A at offset 0, and then a tombstone record for key A appended a week later, at offset 1 billion — we'd have to take a conservative approach here. More specifically:

  • The leader would maintain an additional offset marker up to which we have deleted tombstones, i.e. all tombstones larger than this offset are still in the log. Let's call it a tombstone horizon marker. This horizon marker would be returned in each fetch response.
  • Follower replicas would not do delete its tombstones beyond the received leader's horizon marker, and would maintain it's own marker as well which should always be smaller than leader's horizon marker offset.
  • All fetchers (follower and consumer) would keep track of the horizon marker from the replica it is fetching from. The expectation is that leaders should only delete tombstones in a very lazy manner, controlled by the existing config, so that the fetching position should be larger than the horizon marker most of the time. However, if the fetcher does found its fetching position below the horizon marker, then it needs to handle it accordingly:
    • For consumer fetchers, we need to throw an extended error of InvalidOffsetException to notify the users that we may missed some tombstones, and let the users to reset is consumed record states — for example, if the caller of consumer is a streams client, and it is building a local state from the fetched records, that state need to be wiped.
    • For replica fetchers, we need to wipe the local log all together and then restart fetching from the starting offset.
    • For either types of fetchers, we will set an internal flag, let's call it unsafe-bootstrapping mode to true, while remembering the horizon marker offset at the moment from the replica it is fetching from.
      • When this flag is set to true, we would not fall back to a loop of this handling logic even when the fetching position is smaller than the horizon marker, since otherwise we would fall into a loop of rewinding.
      • However, during this period of time, if the horizon marker offsets get updated from the fetch response, meaning that some tombstones are deleted again, then we'd need to retry this unsafe-bootstrap from step one.
      • After the fetching position as advanced beyond the remembered horizon marker offset, we would reset this flag to false to go back to the normal mode.


We feel this conservative approach is equally, if not more complicated than adding a snapshot mechanism since it requires rather tricky handling logic on the fetcher side; plus it only resolves one of the motivations — the tombstone GC issue.

Another thought agains this idea is that for now most of the internal topics that would need the snapshotting mechanism (controller, txn coordinator, group coordinator) can hold their materialized cache in memory completely, and hence generating snapshots should be comparably more efficient to execute.