Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Proposed Changes

High-level design


Image Modified


The earlier approach consists of pulling the remote log segment metadata from remote log storage APIs as mentioned in the earlier RemoteStorageManager_Old section. This approach worked fine for storages like HDFS. One of the problems of relying on the remote storage to maintain metadata is that tiered-storage needs to be strongly consistent, with an impact not only on the metadata itself (e.g. LIST in S3) but also on the segment data (e.g. GET after a DELETE in S3). Additionally to consistency and availability, the cost (and to a lesser extent performance) of maintaining metadata in remote storage needs to be factored in. In case of S3, frequent LIST APIs incur huge costs. 

...

Below are the leader topic partition's log offsets

Image Modified

...

Lx  = Local log start offset           Lz  = Local log end offset            Ly  = Last stable offset(LSO)

...

Ry  = Remote log end offset       Rx  = Remote log start offset

Lz >= Ly and Ly >= Lx and Ly >= Rand Ry >= Rx

Manage Remote Log Segments

...

For follower replicas, it maintains metadata cache by subscribing to the respective remote log metadata topic partitions. Whenever a topic partition is reassigned to a new broker and RLMM on that broker is not subscribed to the respective remote log metadata topic partition then it will subscribe to the respective remote log metadata topic partition and adds all the entries to the cache. So, in the worst case, RLMM on a broker may be consuming from most of the remote log metadata topic partitions. This requires the cache to be based on disk storage like RocksDB to avoid a high memory footprint on a broker. This will allow us to commit offsets of the partitions that are already read. Committed offsets can be stored in a local file to avoid reading the messages again when a broker is restarted.


Image Modified

[We may add more details later about how the resultant state for each topic partition is computed ]

...