Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. When the producer metadata is removed from the ProducerStateManager on the broker due to retention, the next ProduceRequest from the client will arrive with the existing producer id and with a non-zero sequence. Currently this results in an OutOfOrderSequenceException returned by the broker, since the broker can't find any metadata and gets a non-zero sequence. This isn't strictly correct, and we propose introducing a new UnknownProducerException and returning this instead. 
  2. The client can treat the UnknownProducerException as a non-fatal error and just reinitialize the producer and continue on it's merry way in most cases.
  3. However, the above solution opens a hole: if the first write from the producer is actually lost (maybe due to a simultaneous power outage, multiple disk failures, etc.), we would not detect it. In particular, the first write with sequence = 0 is written, but then the records are lost on the broker. The next write with sequence=N would get an UknownProducerException and with the protocol above would simply be retried. Hence the fact that a message was lost would never be raised to the application.
  4. We can solve the situation in (3), by keeping track of the last ack'd offset on the producer, and also returning the log start offset in each ProduceResponse. With these two pieces of information, we can be sure that an UknownProducerException is valid if the log start offset returned along with the error code is greater than the last ack'd offset. This means that the front of the log has been truncated, causing the producer to become unknown. In this case, there is no unwanted data loss and the last batch can simply be retried. If we get an UnkownProducerException but the log start offset is not greater than the last ack'd offset, then the record has been not been lost due to the retention period elapsing, and this should be treated as a fatal error. 
  5. With the changes above, an OutOfOrderSequenceException would always mean real data loss. An UnkownProducerException may mean some data loss.

Level of Effort

  1. Client side changes to track the last ack'd offset and correctly interpret an UnknownProducerException and either retry it or raise it as an error – 1 day.

  2. Broker side changes to raise the UnkownProducerException– 0.25 days.

  3. Updates to the protocol to return the logStartOffset per partition (with KIP) - 2 days.
  4. System tests + Debugging - 2 days

...