Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The request purgatory consists of a timeout timer and a hash map of watcher lists for event driven processing. A request is put into a purgatory when it is not immediately satisfiable because of unmet conditions. A request in the purgatory is completed later when the conditions are met or is forced to be completed (timeout) when it passed beyond the time specified in the timeout parameter of the request. Currently (0.8.x) it uses Java DelayQueue to implement the timer and Java LinkedList for a watcher list.

When a request is completed, the request is not deleted from the timer or watcher lists immediately. Instead, completed requests are deleted as they were found during condition checking. When the deletion does not keep up, the server may exhaust JVM heap and cause OutOfMemoryError. To alleviate the situation, the reaper thread purges completed requests from the purgatory when the number of requests in the purgatory (including both pending or completed requests) exceeds the configured number. The purge operation scans the timer queue and all watcher lists to find completed requests and deletes them.

...

The goal of the new design is to allow immediate deletion of a completed request and eliminate reduce the load of expensive purge process significantly. It requires cross referencing of entries in the timer and the requests. Also it is strongly desired to have O(1) insert/delete cost since insert/delete operation happens for each request/completion.

To satisfy these requirements, we propose a new purgatory implementation based on Hierarchical Timing Wheels and a new watcher list implementation based on WeakReference.

Hierarchical Timing Wheel

...

A major drawback of a simple timing wheel is that it assumes that a timer request is within the time interval of n * u from the current time. If a timer request is out of this interval, it is an overflow. A hierarchical timing wheel deals with such overflows. It is a hierarchically organized timing wheels. The lowest level has the finest time resolution. As moving up the hierarchy, time resolutions become coarser. If the resolution of a wheel at one level is u and the size is n, the resolution of the next level should be n * u.  At each level overflows are delegated to the wheel in one level higher. When the wheel in the higher level ticks, it reinsert timer tasks to the lower level. A overflow An overflow wheel can be created on-demand. When a bucket in a overflow an overflow bucket expires, all tasks in it are reinserted into the timer recursively. The tasks are then moved the to the finer grain wheels or be executed. The insert (start-timer) cost is O(m) where m is the number of wheels, which is usually very small compared to the number of requests in the system, and the delete (stop-timer) cost is still O(1).

...

A timer task saves a link cell in itself when enqueued to a timer queue. When a task is completed or canceled, the list of is updated using the link cell saved in the task itself. 

...

A simple implementation may use a thread that wakes up every unit time and do the ticking, which checks if there is any task in the bucket. This can be wasteful if requests are sparse. We want the thread to wake up only when when there is a non-empty bucket to expire. We will do so by using java.util.concurrent.DelayQueue similarly to the current implementation, but we will enqueue task buckets instead of individual tasks. This design has a performance advantage. The number of items in DelayQueue is capped by the number of buckets, which is usually much smaller than the number of tasks, thus the number of offer/poll operations to the priority queue inside DelayQueue will be significantly smaller.

WeakReference based

Purge of Watcher

List

Lists

In the current implementation, the purge operation of watcher lists is triggered by the total size if the watcher lists

Removing completed requests from watcher lists is a major pain. The problem is that completed requests are not garbage-collected if we don't remove them from the list. Thus, we monitor the number of outstanding watchers and trigger purge operation by it. Unfortunately tuning of it is not trivial for a large system.

We propose to use a list implementation that uses WeakReference, where a list element is pointed by a weak reference from the list entry. Since it is weak references, a request pointed by one can be reclaimed by GC if there are no strong references in the system. while a request is waiting for completion, there is a strong reference to it from the timer. There shouldn't be any other strong reference. Once completed the strong reference will be cleared, and the request will be reclaimed eventually. The cleared reference will be seen as a null element.

Although completed requests are garbage-collected automatically, there remain list entries in watcher lists. How do we know when to purge the lists? One idea is to use ReferenceQueue. JVM puts a reference object into reference queue  when the reference is cleared. By pulling from the reference queue, we know how many cleared weak references are potentially in the lists. We can do the pulling and counting in ExpiredOperationReaper and tryCompleteWatched.

the watcher lists may exceed the threshold even when there isn't many requests to purge. When this happens it increases the CPU load a lot. Ideally, the purge operation should be triggered by the number of completed requests the watcher lists.

In the new design, a completed request is removed from the timer queue immediately with O(1) cost. It means that the number of requests in the timer queue is the number of pending requests exactly at any time. So, if we know the total number of distinct requests in the purgatory, which includes the sum of the number of pending request and the numbers completed but still watched requests, we can avoid unnecessary purge operations. It is not trivial to keep track of the exact number of distinct requests in the purgatory because a request may or my not be watched. In the new design, we estimate the total number of requests in the purgatory rather than trying to maintain the exactly number.

The estimated number of requests are maintained as follows. The estimated total number of requests, E, is incremented whenever a new request is watched. Before starting the purge operation, we reset the estimated total number of requests to the size of timer queue. If no requests are added to the purgatory during purge, E is the correct number of requests after purge. If some requests are added to the purgatory during purge, E is incremented to E + the number of newly watched requests. This may be an overestimation because it is possible that some of the new requests are completed and remove from the watcher lists during the purge operation. We expect the chance of overestimation and an amount of overestimation are smallExpiredOperationReaper pulls weak references from the reference queue and increment the counter until the queue is emptied. If the final count is greater than some threshold, ExpiredOperationReaper reset the counter to zero and traverses the list to remove all null entries. tryCompleteWatched, on the other hand, drains the reference queue at the beginning and set the counter to zero. Then, just like the current implementation, tryCompleteWatched traverses through the list and removes null entries and entries of completed requests. After a traversal, the list is clean except for potential newly cleared entries. Such new cleared entries will be detected through the reference queue.

Parameters

  • the tick size (the minimum time unit)
  • the wheel size (the number of buckets per wheel)

...

The tick size is 1ms and the wheel size is 20. The timeout was set to 200ms. The data size of a request was 100 bytes. For a low timeout rate case, we chose 75percentile = 60ms and 50percentile = 20. And for a high timeout rate case, we chose 75percentile = 400ms and 50percentile = 200ms.

Image Removed

. Total 1 million requests are enqueued in each run.

Requests are actively completed by a separate thread. Requests that are supposed to be completed before timeout are enqueued to another DelayQueue. And a separate thread keeps polling and completes them. There is no guarantee of accuracy in terms of actual completion time.

The JVM heap size is set to 200m to reproduce a memory tight situation. 

The result shows a dramatic difference in a high enqueue rate area. As the target rate increases, both implementations keep up with the requests initially. However, in low timeout scenario the old implementation was saturated around 60000 40000 RPS (request per second), whereas the proposed implementation didn't show any significant performance degradation, and in high timeout scenario the old implementation was saturated around 25000 RPS, whereas the proposed implementation was saturated 105000 RPS in this benchmark.Another thing worth noting is that there was a performance degradation in the current implementation when timeouts happen more often. It may be because the current implementation has to purge the delay queue and watch lists more often. The proposed implementation shows a slight degradation only when the enqueue rate was very high.

Image Added

CPU usage is significantly better in the new implementation.

Image Added

Finally, we measured total GC time (milliseconds) for ParNew collection and CMS collection. There isn't much difference in the old implementation and the new implementation in the region of enqueue rate that the old implementation can sustain.

Image Added Image Added

 

Summary

In the new design, we use Hierarchical Timing Wheels for the timeout timer, WeakReference based list for watcher lists, and timer and DelayQueue of timer buckets to advance the clock on demand. The advantage is that a completed request is  Completed requests are removed from the timer queue immediately and becomes GCable. When a request is completed we will immediately remove it from a timer bucket with O(1) cost. A strong reference to a request is cleared at this point. All remaining references are weak references from watcher lists, so completed requests become GCable right away. The The buckets remain in the delay queue. However, however, the number of buckets is bounded. And, in a healthy system, most of the requests are satisfied before timeout. Many , and many of the buckets become empty before pulled out of the delay queue. AlsoThus, the timer should rarely have the buckets of the lower interval. We expect a improvement in GC behavior and lower CPU usage. The advantage of this design is that the number of requests in the timer queue is the number of pending requests exactly at any time. This allows us to estimate the number of requests need to be purged. We can avoid unnecessary purge operation of the watcher lists. As the result we achieve a higher scalability in terms of request rate with much better CPU usage.