Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The broker will still build time based index using LogAppendTime, LogAppendTime will be only in the time index file, but not in message format. i.e. not exposed to user.

Change time based log retention and log rolling to use LogAppendTime in index file

The time based log retention and log rolling still needs to use LogAppendTime. Because the leader is source of truth for LogAppendTime, when followers fetch data from the leader, they have to replicate the time index file as well.

When the broker will append a time index file entry for a message when:

  1. The message is the first message of a log segment.
  2. The message is the last message of a log segment.
  3. The message is the first message received in the minute.

Let replicas to also fetch log index file

Because LogAppendTime is not included in the message format. With current replication design, followers will not be able to get the LogAppendTime from leader. In order to make log retention and log rolling policy work, the LogAppendTime needs to be propagated from leader to followers.

In this option, the LogAppendTime only exist in the time index file, therefore when followers fetch data from the leader, they have to replicate the time index file as well.

There are a few requirements here:

  1. Unlike log index file, the time index file should not be rebuilt from local log when it crashes, but should always be fetched from the current leader, just the same as actual data. Otherwise we may have different time index on different replicas.
  2. To ensure the log segments are identical on both leader and followers, we should always have a time index entry for the first message in a log segment.
  3. In order to make the time based log retention work, we need the timestamp entry for the last message in a log segment.
  4. When we truncate the logs from log segment file, we need to truncate data for time index file as well.

On way to achieve the above goals is to treat the time index as a special partition that associated with the actual partition. When the leader receives a fetch request from the follower, it will include the data from this companion partition. More specifically, we may do the following:

  1. Make the time-index-partition number to be the ~(PositivePartitionNumber) ( 0 -> 0xffffffff, 1 -> 0xfffffffe, etc.)
  2. When a leader sees a fetch request from followers, the leader will send back the (TopicPartition -> TimeIndexEntry) data in the fetch response. The TimeIndexEntry will contain the entry up to the follower's LEO.

Therefore, this approach does not solve the time based log retention and log rolling issues which are the motivation of this KIP. We need to introduce separate wire protocol to propagate the log segment create time and last modified time among brokers. While it is doable, we feel the additional complication for replica fetching over weigh the concern of exposing the LogAppendTime to user, considering the LogAppendTime is still useful to clients in a few use cases.Also, during a log recovery, the LogAppendTime in the time index will be almost the same. The LogAppendTime will be different from the actual time when the message arrives the brokers.