Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Replace all local storage with remote storage - Instead of using local storage on Kafka brokers, only remote storage is used for storing log segments and offset index files. While this has the benefits related to reducing the local storage, it has the problem of not leveraging the OS page cache and local disk for efficient latest reads as done in Kafka today. 
  2. Implement Kafka API on another store - This is an approach that is taken by some vendors where Kafka API is implemented on a different distributed, scalable storage (example HDFS). Such an option does not leverage Kafka other than API compliance and requires the much riskier option of replacing the entire Kafka cluster with another system.
  3. Client directly reads remote log segments from the remote storage - The log segments on the remote storage can be directly read by the client instead of serving it from Kafka broker. This reduces Kafka broker changes and has benefits of removing an extra hop. However, this bypasses Kafka security completely, increases Kafka client library complexity and footprint, causes compatibility issues to the existing Kafka client libraries, and hence is not considered. 
  4. Store all remote segment metadata in remote storage. This approach works with the storage systems that provide strong consistent metadata, such as HDFS, but does not work with S3 and GCS. Frequently calling LIST API on S3 or GCS also incurs huge costs. So, we choose to store metadata in a Kafka topic in the default implementation, but allow users to use other methods with their own RLMM implementations.
  5. Cache all remote log indexes in local storage. Store remote log segment information in local storage. 


Meeting Notes


  •  Discussion Recording
  • Notes

    1. Tiered storage upgrade path dicussion:

    • Details need to be documented in the KIP.
    • Current upgrade path plan is based on IBP bump.
    • Enabling of the remote log components may not mean all topics are eligible for tiering at the same time.
    • Should tiered storage be enabled on all brokers before enabling it on any brokers?
    • Is there any replication path dependency for enabling tiered storage?


    2. RLMM persistence format:

    • We agreed to document the persistence format for the materialized state of default RLMM implementation (topic-based).
    • (carry over from earlier discussion) For the file-based design, we don't know yet the % of increase in memory, assuming the majority of segments are in remote storage. It will be useful to document an estimation for this.


    3. Topic deletion lifecycle discussion:

    • Under topic deletion lifecycle, step (4) it would be useful to mention how the RemotePartitionRemover (RPM) gets the list of segments to be deleted, and whether it has any dependency with the RLMM topic.


    4. Log layer discussion:

    • We discussed the complexities surrounding making code changes to Log layer (Log.scala).
    • Today, the Log class holds attributes and behavior related with local log. In the future, we would have to change the Log layer such that it would also contain the logic for the tiered portion of the log. This addition can pose a maintenance challenge.
    • Some of the existing attributes in the Log layer such as LeaderEpochCache and ProducerStateManager can be related with global view of the log too (i.e. global log is local log + tiered log). It can be therefore useful to think about preparatory refactoring, to see whether we can separate responsibilities related with the local log from the tiered log, and, perhaps provide a global view of the log that combines together both as and when required. The global view of the log could manage the lifecycle of LeaderEpochCache and ProducerStateManager.

    Follow-ups:
    • KIP-405 updates (upgrade path, RLMM file format and topic deletion)
    • Log layer changes


    (Notes taken by Kowshik)


  •  Discussion Recording
  • Notes
  • Satish discussed KIP-405 updates:

    1. Addressed some of the outstanding review comments from previous weeks.
    2. Remote log manager (RLM) cache configuration was added.
    3. Updated default values in the KIP for certain configuration parameters.
    4. RLMM committed offsets are stored in separate files.
    5. Initial version: go ahead with in-memory RLMM materializer implementation for now. Future switch to RocksDB seems feasible since it is an internal change only to RLMM cache.
    6. Yet to update the KIP with KIP-516 (topic ID) changes.
    7. Tiered storage upgrade path details are a work-in-progress. Will be added to the KIP.
    RLMM cache choice: RocksDB-based vs file-based:
    1. Harsha/Satish didn't see significant improvement in performance when they tried RocksDB in their prototype.
    2. Other advantages of RocksDB were discussed - snapshots, tooling, checksums etc.
    3. As for file-based design, we don't know yet the % of increase in memory, assuming the majority of segments are in remote storage.
    4. Currently a single file-based implementation for the whole broker is considered. But this may have some issues, so it can be useful to consider a file per partition.
    5. More details needed to be added to the KIP on file management, metadata operations, persisted data format and estimates on memory usage.
    Topic ID:
    1. KIP-516 PR may land by end of the year, so we should be able to use it in KIP-405.
    2. Satish to update KIP with details.
    (Notes taken by Kowshik)

  •  Discussion Recording
  • Notes
    •  Discussed that we can have producerid snapshot for each log segment which will be copied to remote storage. There is already a PR for KAFKA-9393 which addresses similar requirements.
    • Discussed on a case when the local data is not available on brokers, whether it is possible to recover the state from remote storage. 
    • We will update the KIP by early next week with
      • Topic deletion design proposed/discussed in the earlier meeting. This includes the schemas of remote log segment metadata events stored in the topic. 
      • Producerid snapshot for each segment discussion.
      • ListOffsets API version bump to support offset for the earliest local timestamp.
      • Justifying the rationale behind keeping RLMM and local leader epoch as the source of truth. 
      • Rocks DB instances as cache for remote log segment metadata.
      • Any other missing updates planned earlier. 

...