Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Users don't typically need to look up offsets with seconds granularity.

...

Use case discussion

Mirror maker

The broker does not distinguish mirror maker from other producers. The following example explains what will the timestamp look like when there is mirror maker in the picture.(S - SendTime, R - ReceiveTime)

...

The ReceiveTime of a message in source cluster and target cluster will be different. This is because: 1) log retention/rolling needs to based on server clock to provide clear guarantee. 2) To support searching by timestamp, the ReceiveTime in a log file needs to be monotonically increasing, even when the messages in the target cluster came from different source clusters.

Log Retention

There are both pros and cons for log retention to be based on SendTime or Receive time.

Use SendTime:

  • Good: The log retention will be associated the message creation time, and in ideal case it will not be affected by the latency in the data pipeline because the send time will not change. The assumption here is the message with similar SendTime will reach a cluster at around same time.
  • Bad: When the messages with different timestamp goes into a cluster at around same time, the retention policy is hard to follow. For example, imagine two mirror makers copy data from two source clusters to the same target cluster. If MirrorMaker1 is copying Messages with SendTime around 1:00 PM today, and MirrorMaker2 is copying messages with SendTime around 1:00 PM yesterday. Those messages can go to the same log segment in the target cluster. It will be difficult for broker to apply retention policy to the log segment.
    1) The broker needs to maintain the knowledge about the latest SendTime of all messages in a log segment and persist the information somewhere.
    2) If there is a message with SendTime set to the future, the log might be kept for very long. Broker needs to sanity check the timestamp when receive the message. It could by tricky to determine which timestamp is not valid

Use ReceiveTime:

  • Good: The log retention policy is easy to enforce and it does not suffer from wrong client timestamp.
  • Bad: The log retention will be independent on each Kafka cluster in the pipeline. Because of the latency difference, some data can be deleted on one cluster, but not on another cluster in the pipeline.

Comparison:

The key issue here is about the latency of the messages flow through the pipeline. The latency can be summarized to the following pattern:

  1. the messages flow through the pipeline with same small latency.
  2. the messages flow through the pipeline with same large latency.
  3. the messages flow through the pipeline with small latency difference.
  4. the messages flow through the pipeline with large latency difference.
 pattern 1pattern 2pattern 3pattern 4
PreferenceS = RS > RS = RS < R

As we can see, it folds down to whether pattern 2 or pattern 4 is more likely. In reality, we rarely see

Leader migration

Suppose we have broker0 and broker1. Broker0 is the current leader of a partition and broker1 is a follower. Consider the following scenario:

...