...
The request purgatory is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Produce and Fetch requests. The main reasons for keeping requests in purgatory while waiting for them to be satisfied are:
- support for long polling fetch requests; e.g. it keeps clients from issuing repetitive requests to fetch messages when no data is readily available
- avoid blocking the server network threads and filling the request queue while waiting until conditions to send producer and fetch responses are met
Both produce and fetch request types have different conditions to be added to and removed from their respective purgatory. In fact, there is a requests purgatory implementation for Producer (ProducerRequestPurgatory) and Fetch (FetchRequestPurgatory). Both extend the RequestPurgatory abstract class to add request type-specific implementations for expiration and satisfaction conditions checks.
...
The purge interval configuration (*.purgatory.purge.interval.requests) is mostly an "internal" configs that generally don't need to be modified by users. The reasons why it was added are as follow:
- We found that for low-volume topics, replica fetch requests were getting expired but sitting around in purgatory
- This was because we were expiring them from the delay queue (used to track when requests should expire), but they were still sitting in the watcherFor map - i.e., they would get purged when the next producer request to that topic/partition arrived, but for low volume topics this could be a long time (or never in the worst case) and we would eventually run into an OOME.
- So we needed to periodically go through the entire watcherFor mapand explicitly remove those requests that had expired.
Info |
---|
More details on this are in KAFKA-664 |
...
When is it added to purgatory (delayed)?
- When it uses ack=-1 (actually, anything but 0 or 1),
- Partitions have more than one replica (in this case, ack=-1 isn't different to ack=1 and it doesn't make much sense to use a delayed request),
- Not all partitions are in error
Producer config: request.required.acks
...
To be verified
The same metric name is used for both Fetch and Producer, I should probably open a Jira for this but it gives trouble to the csv reporter since it tries to create a second file with the same name (for whichever metric comes first)
References
- Reason behind clean up interval, discussion in Kafka server threads die due to OOME during long running test
- Original user mailing list email that triggered the creation of this page