Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tiered storage does not replace ETL pipelines and jobs. Existing ETL pipelines continue to consume data from Kafka as is, albeit with data in Kafka having a much longer retention period.

...


Remote Log Manager (RLM) is a new component that copies the completed LogSegments and corresponding OffsetIndex to remote tier.

  • RLM component will keep tracks track of topic-partition and its segments. It will delegate the copy and read of these segments to pluggable storage manager implementation.
  • RLM has two modes:
    • RLM Leader - In this mode, RLM that is the leader for topic-partition, checks for rolled over LogSegments and copies it along with OffsetIndex to the remote tier. RLM creates an index file, called RemoteLogSegmentIndex, per topic-partition to track remote LogSegments. Additionally, the RLM leader also serves the read requests for older data from the remote tier.
    • RLM Follower - In this mode, RLM keeps track of the segments and index files on remote tier and updates its RemoteLogSegmentIndex file per topic-partition. RLM follower does not serve reading old data from the remote tier.

...

System-Wide
  • remote.log.storage.enable = false (to support backward compatibility)
  • remote.storage.manager.class =  org.apache.kafka.log.HdfsRemoteStorageManager
RemoteStorageManager

(These configs are dependent on remote storage manager implementation)

  • remote.storage.manager.config
Per Topic Configuration
  • remote.retention.period

  • remote.retention.bytes

...

RemoteLogManager is a new class added to the broker. It is responsible to copy the completed log segments to the remote storage and update RemoteOffsetIndex file. 

Only the broker that is the Leader for topic-partitions is allowed to copy to the remote storage.

Note: Early proposal. To be finalized during implementation.

...

languagescala

...

  • .apache.kafka.log.HdfsRemoteStorageManager
RemoteStorageManager

(These configs are dependent on remote storage manager implementation)

  • remote.storage.manager.config
Per Topic Configuration
  • remote.retention.period

  • remote.retention.bytes

Remote Storage Manager:

        RemoteStorageManager is an interface that allows to plugin different remote storage implementations to copy the log segments. The default implementation of the interface supports HDFS as remote storage. Additional remote storage support, such as S3 can be added later by providing other implementations using the configuration remote.storage.manager.class.

...

For follower fetch, the leader only returns the data that is still in the leader's local storage. If a LogSegment copied into Remote Storage by a Leader Broker, it's not necessary for Follower Follower doesn't need to copy this segment which is already present in Remote Storage. Instead, a follower will retrieve the information of the segment from remote storage. If a Replica becomes a Leader, It can still locate and serve data from Remote storage.

...

Remote storage (e.g. S3 / HDFS) is likely to have higher I/O latency and lower availability than local storage.

When the remote storage becoming temporarily unavailable (up to several hours) or having high latency (up to minutes), Kafka should still be able to operate normally. All the Kafka operations (produce, consume local data, create/expand topics, etc.) that do not rely on remote storage should not be impacted. The consumers that try to consume the remote data should get reasonable errors, when remote storage is unavailable or the remote storage requests timeout.

...

  1. Replace all local storage with remote storage - Instead of using local storage on Kafka brokers, only remote storage is used for storing log segments and offset index files. While this has the benefits related to reducing the local storage, it has the problem of not leveraging the local disk for efficient latest reads as done in Kafka today.

  2. Implement Kafka API on another store - This is an approach that is taken by some vendors where Kafka API is implemented on a different distributed, scalable storage (example HDFS). Such an option does not leverage Kafka other than API compliance and requires the much riskier option of replacing the entire Kafka cluster with another system.

  3. Client directly reads remote log segments from the remote storage - The log segments on the remote storage can be directly read by the client instead of serving it from Kafka broker. This reduces Kafka broker changes and has benefits of removing an extra hop. However, this bypasses Kafka security completely, increases Kafka client library complexity and footprint and hence is not considered.