Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We separated RemoteStorageManager and RemoteLogMetadataManager after going through the challenges we faced with eventually consistents consistent stores like S3. You can see the discussion details in the doc located here.

...

Metadata of remote log segments are stored in an internal topic called `__remote_log_metadata`. This topic can be created with default partitions count as 50. Users can configure the topic name and , partitions count and replication factor etc

In this design, RemoteLogMetadataManager(RLMM) is responsible for storing and fetching remote log metadata. It provides

...

user-topic-partition.toString().hashCode() % no_of_remote_log_metadata_topic_partitions

RLMM registers the topic partitions that the broker is either a leader or a follower. 

For leader partition replicas,  RemoteLogManager(RLM) copies the log segment and indexes to the remote storage with the given UUID RemoteLogsegmentId (RemoteStorageManager#copyLogSegment API). After this, RLM calls RLMM to store remote log metadata. This is stored in the remote log metadata topic and updates the cache. 

For follower partition replicas, RLM fetches the remote log segment information for a given offset from RLMM. It fetches remote log index entries by using RemoteStorageManager.

RLMM registers the topic partitions that the broker is either a leader or a follower. 

For leader topic partitions, it follows the process as mentioned in the earlier section. 

For follower partitions, it maintains metadata cache by subscribing to the respective remote log metadata topic partitions. Whenever a topic partition is reassigned to a new broker and RLMM on that broker is not subscribed to the respective remote log metadata topic partition then it will subscribe to the respective remote log metadata topic partition and adds all the entries to the cache. So, in the worst case, RLMM on a broker may be consuming from most of the remote log metadata topic partitions. This requires the cache to be based on disk storage like RocksDB to avoid a high memory footprint on a broker. This will allow us to commit offsets of the partitions that are already read. Committed offsets can be stored in a local file to avoid reading the messages again when a broker is restarted.

...