Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The earlier approach consists of pulling the remote log segment metadata from remote log storage APIs as mentioned in the earlier RemoteStorageManager_Old section. This approach worked fine for storages like HDFS. One of the problems of relying on the remote storage to maintain metadata is that tiered-storage needs to be strongly consistent, with an impact not only on the metadata itself (e.g. LIST in S3) but also on the segment data (e.g. GET after a DELETE in S3). Also, the cost (and to a lesser extent performance) of maintaining metadata in remote storage needs to be factored in. In the case of S3, frequent LIST APIs incur huge costs. 

...

If the broker changes its state from Leader to Follower for a topic-partition and RLM is in the process of copying the segment, it will finish the copy before it relinquishes the copy for topic-partition. This might leave duplicated segments but these will be cleanedup cleaned up when these segments are ready for deletion based on remote retention configs.

...

Currently, followers replicate the data from the leader, and try to catch up till until the log-end-offset of the leader to become in-sync replicas. Followers maintain the same log segments lineage as the leader by doing the truncation if required.

With tiered storage, followers need to maintain the same log segments lineage as the leader. Followers replicate the data that is only available on the leader's local storage. But they need to build the state like leader epoch cache and producer id snapshots for the remote segments and they also need to do truncation if required. 

Below The below diagram gives a brief overview of the interaction between leader, follower, and remote log  log and metadata storages. It will be described more in detail in the next section.

...

  1. Leader copies log segments with the auxiliary state(includes leader epoch cache and producer-id snapshots) to remote storage.
  2. Leader publishes remote log segment metadata about the copied remote log segment, 
  3. Follower tries to fetch the messages from the leader and follows the protocol mentioned in detail in the next section. 
  4. Follower waits till it catches up consuming the required remote log segment metadata.
  5. Follower fetches the respective remote log segment metadata to build auxiliary state.

...

Currently, followers build the auxiliary state (i.e. leader epoch sequence, producer snapshot state) when they fetch the messages from the leader by reading the message batches. Incase of tiered storage, follower finds the offset and leader epoch upto up to which the auxiliary state needs to be built from the leader. After which,   followers start fetching the data from the leader starting from that offset. That offset can be local-log-start-offset or next-local-offset. Local-log-start-offset is the log start offset of the local storage. Next local offset is the offset upto up to which the segments are copied to remote storage. We will describe pros and cons of choosing these segments.

next-local-offset

  • Advantage The advantage with this option is that followers can catchup catch up quickly with the leader as the segments that are required to be fetched by followers are the segments which that are not yet moved to remote storage.  
  • One disadvantage with this approach is that followers may have a few local segments than the leader. When that follower becomes a leader then the existing followers will truncate their logs to the leader's local log-start-offset. 

...

We prefer to go with local log start offset as the offset from which follower starts to replicate the local log segments for the  the above mentioned reasons.

With tiered storage, the leader only returns the data that is still in the leader's local storage. Log segments that exist only on remote storage are not replicated to followers as those are already present in remote storage. Followers fetch offsets and truncate their local logs if needed with the  the current mechanism based on the leader's local-log-start-offset. This is described with several cases in detail in the next section.

...

Repeatedly call the FetchEarliestOffsetFromLeader API from ELO-LE to the earliest leader epoch that the leader knows, and build local leader epoch cache accordingly. This option may not be very efficient , when there were a lot of leadership changes. The advantage of this option is that the entire process is in Kafka, even when the remote storage is temporarily unavailable, the followers can still catch up and join ISR.

...

1) Wait for RLMM to receive remote segment information , until there is a remote segment that contains the ELO-LE.

...

RLM receives a callback and unassigns the partition for leader/follower task, If the delete option is enabled then the leader will stop RLM task and stop processing and it sets all the remote log segment metadata of that partition with a delete marker and publishes them to RLMM. Controller The controller will eventually delete these segments by using RemoteStorageManager. Controller The controller will not allow topic with the same name to be created till all the segments are cleanedup cleaned up from remote storage.

It was discussed in the community earlier for adding UUID to represent a topic along with the name as part of KIP-516. This enhancement will be useful to make the deletion of topic partitions in remote storage asynchronously without blocking the creation of topic with the same name even though all the segments are not deleted in  in remote storage.   

OffsetForLeaderEpoch

...