Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Unlike log index file, the time index file should not be rebuilt from local log when it crashes, but should always be fetched from the current leader, just the same as actual data. Otherwise we may have different time index on different replicas.
  2. To ensure the log segments are identical on both leader and followers, we should always have a time index entry for the first message in a log segment.
  3. In order to make the time based log retention work, we need the timestamp entry for the last message in a log segment.
  4. When we truncate the messages in log segment files, we need to truncate entries in the time index files as well.

One way to achieve the above goals is to treat the time index as a special partition that associated with the actual partition. When the leader receives a fetch request from the follower, it will include the data from this companion partition. More specifically, we may do the following:

  1. Make the time-index-partition number to be the ~PositivePartitionNumber ( 0 -> 0xffffffff, 1 -> 0xfffffffe, etc.)
  2. When a leader sees a fetch request from followers, the leader will send back the (TopicPartition -> TimeIndexEntry) data in the fetch response. The TimeIndexEntry will contain the entry up to the follower's LEO.

To replicate the log index entry as well, we can add the log index entry to FetchResponse, so the fetch response will become

Code Block
titleFetchResponse format for replication
FetchResponse => [TopicName [Partition ErrorCode HighwaterMarkOffset MessageSetSize MessageSet [TimeIndexEntry]]]
  TopicName => string
  Partition => int32
  ErrorCode => int16
  HighwaterMarkOffset => int64
  MessageSetSize => int32
  TimeIndexEntry => LogAppendTime Offset <------------- new, the time index entry in the message set, one partition might contain multiple entries. The array will always be empty if the fetch request is not from followers.
    LogAppendTime => int64
    Offset => int32

 While it is doable, we feel the additional complication for replica fetching over weigh the concern of exposing the LogAppendTime to user, considering the LogAppendTime is still useful to clients in a few use cases.