Status

Current state: Under discussion

Discussion thread:

JIRA: Unable to render Jira issues macro, execution error. Unable to render Jira issues macro, execution error.

PR: https://github.com/apache/lucene-solr/pull/1100

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

Motivation

Many common JVM resources such as CPUs, threads, file descriptors, heap, etc. are shared between multiple SolrCore-s within a CoreContainer.

Most of these resources can already be monitored for usage using metrics. However, in most cases Solr doesn't have any control mechanism to actually do something about excessive use (or extreme under-utilization) of a resource by any particular SolrCore or CoreContainer. Furthermore, even when a control mechanism exists it's usually available only as a static configuration parameter (eg. max cache size) and changing it requires at least a core reload, or restarting the JVM.

This issue is especially important for multi-tenant applications where the admin cannot assume voluntary co-operation of users and needs more fine-grained tools to prevent DOS attacks, either accidental or purposeful.

The main scenario that prompted this development was a need to control the aggregated cache sizes across all cores in a CoreContainer in a multi-tenant (uncooperative) situation. However, it seems like a similar approach would be applicable for controlling other runtime usage of resources in a Solr node - hence the attempt to come up with a generic framework.

This SIP proposes the following:

adding a generic ResourceManager component to Solr, which will run at a CoreContainer level and will be able to monitor and enforce both global limits and a "fair" division of resources among competing resource usage across SolrCore-s.
extending key existing components so that their resource consumption aspects can be dynamically controlled.
adding a number of management plugins that implement specific strategies for managing eg. the cache sizes according to the specified "fairness" policy and global limits.
adding an API for pool CRUD, monitoring and controlling the resource management (reading and setting pool and component limits).

High-level design overview

A resource manager instance is created for each CoreContainer, and it's initialized from ZK configuration file. Initialization includes creating default pools and setting their global limits.

A pool has a unique name and a type (eg. "cache"), which is defined by the pool implementation. A pool defines total limits for components of this type in this pool - e.g. "searcherFieldValueCache" pool knows how to handle components of SolrCache type, and it manages all instances of SolrCache in all SolrCore-s that are responsible for field value caching, and it defines total limits for all searcher field value caches across all SolrCore-s. There can be multiple pools of the same type (e.g. "cache") under different names and with different parameters (total limits, schedule, etc), each managing different set of components. Pool configuration specifies the initial limits as well as the interval between management runs - resource manager is responsible for executing each pool's management at the specified intervals.

Limits are expressed as arbitrary name / value pairs, which make sense for the specific pool implementation - e.g. for a "cache" pool type the supported limits are "maxRamMB" and "maxSize". By convention limits use the same names as the component limits (controlled parameters - see below).

A pool manages components registered in the pool. Any component can be registered only in a single pool of the same type - eg. a SolrCache component can only be registered in one "cache" type pool. Pool management is run periodically according to a per-pool schedule. Implementation-specific logic in the pool makes decisions about adjusting the limits of individual components so that the total resource consumption stays within pool limits.

Typically it's the responsibility of the component creator to register and unregister it with a specific pool. E.g. SolrIndexSearcher registers all its caches in their respective pools, and unregisters them on close.

Managed components report two different sets of values: controlled parameters (component limits) and monitored values. Monitored values represent the actual current resource usage (eg. current cache size) while the controlled parameters represent adjustable component limits (eg. "maxSize"). These two sets are different because usually we can't control the actual resource usage (eg. current cache size depends on random usage patterns), we can only control the adjustable limits and then we can observe the impact of these changes on the actual resource usage.

A public API is provided under /admin/resources. It supports inspecting pools and components as well as setting pool limits and component limits.

Control vs. optimization

The framework should support two aspects of resource management:

hard limit control - ensures that all available resources are used and the usage does not exceed total limits. E.g. for all caches managed by the "cache" pools this means setting the "maxSize" (or "maxRamMB") of each cache so that the total value is equal to the total limit specified by the pool configuration.
optimization - attempts to optimize resource usage based on a deeper understanding of the component specifics. E.g. for all caches managed by the "cache" pools this means adjusting the cache size based on the amount of available resources and the current hit ratio of the cache, by expanding the caches with low hit ratio (to increase the likelihood of caching useful values) and shrinking the caches with high hit ratio (because a lower hit ratio may still be acceptable while saving memory).

Use case stories

Story 1: controlling global cache RAM usage in a Solr node

SolrIndexSearcher caches are currently configured statically, using either item count limits or maxRamMB limits. We can only specify the limit per-cache and then we can limit the number of cores in a node to arrive at a hard total upper limit.

However, this is not enough because it leads to keeping the heap at the upper limit when the actual consumption by caches might be far lesser. It'd be nice for a more active core to be able to use more heap for caches than another core with less traffic while ensuring that total heap usage never exceeds a given threshold (the optimization aspect). It is also required that total heap usage of caches doesn't exceed the max threshold to ensure proper behavior of a Solr node (the control aspect).

In order to do this we need a control mechanism that is able to adjust individual cache sizes per core, based on the total hard limit and the actual current "need" of a core, defined as a combination of hit ratio, QPS, and other arbitrary quality factors / SLA. This control mechanism also needs to be able to forcibly reduce excessive usage (evenly? prioritized by collection's SLA?) when the aggregated heap usage exceeds the threshold.

In terms of the proposed API this scenario would work as follows:

global resource pools "searcher*Pool" are created with a hard limit on eg. total maxRamMB.
these pools knows how to manage components of a "cache" type - what parameters to monitor and what parameters to use in order to control their resource usage. This logic is encapsulated in CacheManagerPool implementation.
all searcher caches from all cores register themselves in these pools for the purpose of managing their "cache" aspect.
the pools are executed periodically to check the current resource usage of all registered caches (monitored values), using eg. the aggregated value of ramBytesUsed.
if this aggregated monitored value exceeds the total maxRamMB limit configured for the pool then the plugin adjusts the maxRamMB setting of each cache in order to reduce the total RAM consumption - currently this uses a simple proportional formula without any history (the P part of PID), with a dead-band in order to avoid thrashing.
as a result of this action some of the cache content will be evicted sooner and more aggressively than initially configured, thus freeing more RAM.
when the memory pressure decreases the CacheManagerPool may expand the maxRamMB settings of each cache to a multiple of the initially configured values. This is the optimization part.

Story 2: controlling global IO usage in a Solr node

Similarly to the scenario above, currently we can only statically configure merge throttling (RateLimiter) per core but we can't monitor and control the total IO rates across all cores, which may easily lead to QoS degradation of other cores due to excessive merge rates of a particular core.

Although RateLimiter parameters can be dynamically adjusted, this functionality is not exposed, and there's no global control mechanism to ensure "fairness" of allocation of available IO (which is limited) between competing cores.

In terms of the proposed API this scenario would work as follows:

a global resource pool "mergeIOPool" is created with a single hard limit maxMBPerSec, which is picked based on a fraction of the available hardware capabilities that still provides acceptable performance.
this pool knows how to manage components of a "mergeIO" type. It monitors their current resource usage (using SolrIndexWriter metrics) and knows how to adjust each core's ioThrottle. This logic is encapsulated in MergeIOManagerPool (doesn't exist yet).
all SolrIndexWriter-s in all cores register themselves in this pool for the purpose of managing their "mergeIO" aspect.

The rest of the scenario is similar to the Story 1. As a result of the pool's adjustments the merge IO rate of some of the cores may be decreased / increased according to the available pool of total IO.

Public Interfaces

ResourceManager - base class for resource management. Only one instance of resource manager is created per Solr node (CoreContainer)
- DefaultResourceManager - default implementation.
ResourceManagerPoolFactory - base class for creating type-specific pool instances.
- DefaultResourceManagerPoolFactory - default implementation, containing default registry of pool implementations (currently just cache → CacheManagerPool).
ResourceManagerPool - base class for managing components in a pool.
- CacheManagerPool - pool implementation specific to cache resource management.
ChangeListener - listener interface for component limit changes. Pools report any changes to their managed components' limits via this interface.
ManagedComponent - interface for components to be managed
ManagedComponentId - hierarchical unique component ID
SolrResourceContext - component's context that helps to register and unregister the component from its pool(s)
ResourceManagerHandler - public API for pool operations (CRUD) and resource operations (RUD)

`CacheManagerPool` implementation

This pool implementation manages SolrCache components, and it supports "maxSize" and "maxRamMB" limits.

The control algorithm consists of two phases:

hard limit control - applied only when total monitored resource usage exceeds the pool limit. In this case the controlled parameters are evenly and proportionally reduced by the ratio of actual usage to the total limit.
optimization - performed only when the total limit is not exceeded, because it may want to not only shrink but also expand cache sizes thus making a bad situation worse. Optimization uses hit ratio to determine whether to shrink or to expand each cache individually, while still staying within the total resource limits.

Some background on hitRatio vs. cache size: the relationship between cache size and hit ratio is positive and monotonic, ie. larger cache size leads to a higher hit ratio (an extreme example would be an unlimited cache that has a perfect recall because it keeps all items). On the other hand there's a point where increasing the size yields diminishing returns in terms of higher hit ratio, if we also consider the cost of the resources it consumes. So there's a sweet spot in the cache size where the hit ratio is still "good enough" but the resource consumption is minimized. In the proposed PR this hit ratio threshold is 0.6, which may be probably too high for realistic loads (should we use something like 0.4?).

Hit ratio, by its definition, is an average outcome of several trials on a stochastic process. For this average to have a desired confidence there's a minimum number of trials (samples) needed. The PR is using a formula of 0.5 / sqrt(lookups) to determine the minimum number of lookups for a given confidence level - the default value being 100 lookups for a 5% accuracy. If there are fewer lookups between adjustments then this means that the current hit ratio cannot be determined with enough confidence and the optimization is skipped.

Maximum possible adjustments are bounded by a maxAdjustRatio (by default 2.0). This means that the pool can grow or shrink each managed cache at most by this factor as compared to the initially configured limit. This functionality prevents the algorithm from ballooning or shrinking the cache indefinitely for very busy or very idle caches.

(This algorithm is in fact a very simple PID controller, but without the ID factors (yet )).

Resource management and component life-cycle

Components are created outside the scope of this framework, and then their creators may register the components with the framework (using ManagedComponent.initializeManagedComponent(...) method). From now on the component is managed by at least one pool. When a component's close() method is called its SolrResourceContext is responsible for unregistering the component from all pools - for this reason it's important to always call super.close() (or ManagedComponent.super.close()) in component implementations - failure to do so may result in object leaks.

Components are always identified by unique component ID, specific to this instance of a component, because there may be multiple instances of the same component under the same logical path. This is a similar model that already works well with complex Solr metrics (such as gauges), where often an overlap in the life-cycle of logically identical metrics occurs. E.g. when re-opening a searcher a new instance of SolrIndexSearcher is created, but the old one still remains open for some time. The new instance proceeds to register its caches as managed components (the respective pools then correctly reflect the fact that suddenly there's a spike in resource usage because the old searcher is not closed yet). After a while the old searcher is closed, at which point it unregisters its old caches from the framework, which again correctly reflects the fact that some resources have been released.

Proposed Changes

Internal changes

Framework and pool bootstraps

CoreContainer creates and initializes a single instance of ResourceManager in its load() method. This instance is configured using a new section in /clusterprops.json/poolConfigs. Several default pools are always created (at the moment they are all related to SolrIndexSearcher caches) but their parameters can be customized using the clusterprops.

SolrIndexSearcher.register() now also registers all its caches in their respective pools and unregisters them on close().

Other changes

SolrMetricsContext now as a rule is created for each child component, and it includes also the component's metric names and scope. This simplifies the management of metrics, obtaining metrics snapshots - and it was needed in order to construct fully-qualified component IDs for the resource API.
SolrCache.warm(...) also re-sets the limits (such as maxSize and maxRamMB) using the old cache's limits - this is to preserve custom limits from the old instance when a new instance is a replacement for the old one.

User-level APIs

Config files

The only change in configuration is a new optional section in /clusterprops.json/poolConfigs. This section contains a map of pre-defined pools and their initial limits and properties.

Remote API

There's a new handler ResourceManagerHandler accessible at /admin/resources (v1 API only for now - I plan to add v2 API once the functionality stabilizes).

NOTE: Persisting the changes is not yet implemented in the PR. Also, the handler supports only local operations - it needs to be modified to dispatch operations to all live nodes.

Operations that select named items (pools or resources) all treat the name as a prefix, ie. the selected items are those that match the prefix provided as the name parameter.

The following operations are supported:

Pool operations
- LIST - lists selected pools and their limits and parameters.
- STATUS - lists selected pools and their components, and current total resource consumption per pool.
- CREATE - create a new pool, with the provided limits (limit.<name>=<value>) and parameters (param.<name>=<value>).
- DELETE - delete an existing pool (and unregister its components)
- SETLIMITS - set, modify or delete existing pool(s) limits
- SETPARAMS - set, modify or delete existing pool(s) parameters
Resource operations
- LIST - list components in specified pool(s) and their current resource limits
- STATUS - list components in specified pool(s) and their current limits and their current monitored values
- GETLIMITS - get the current limits of specified component(s)
- SETLIMITS - set the current limits of specified component(s)
- DELETE - unregister specified components from the pool(s)

Compatibility, Deprecation, and Migration Plan

This is a new feature so there's no deprecation involved. Some of the internal Java APIs are modified but this functionality is scheduled to be included in Solr 9.0 so we can break back-compat if needed.

Users can migrate to this framework gradually by specifying concrete resource limits in place of the defaults - the default settings create unlimited pools for searcher caches so the back-compat behavior remains the same.

When will we remove the existing behavior?

Test Plan

An integration test TestCacheDynamics has been created to show the behavior of cache resource management under changing resource constraints. Obviously more tests are needed on a real cluster.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Space shortcuts

Page tree

Status

Motivation

High-level design overview

Control vs. optimization

Use case stories

Story 1: controlling global cache RAM usage in a Solr node

Story 2: controlling global IO usage in a Solr node

Public Interfaces

`CacheManagerPool` implementation

Resource management and component life-cycle

Proposed Changes

Internal changes

Framework and pool bootstraps

Other changes

User-level APIs

Config files

Remote API

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Page tree

SIP-4 Resource management framework

Status

Motivation

High-level design overview

Control vs. optimization

Use case stories

Story 1: controlling global cache RAM usage in a Solr node

Story 2: controlling global IO usage in a Solr node

Public Interfaces

CacheManagerPool implementation

Resource management and component life-cycle

Proposed Changes

Internal changes

Framework and pool bootstraps

Other changes

User-level APIs

Config files

Remote API

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

`CacheManagerPool` implementation