Status

Current state: "Under Discussion"

Discussion thread: https://lists.apache.org/thread/n1hyo9wod5mqc02sh388dlzr2k29qmhn

JIRA: Unable to render Jira issues macro, execution error.

Released: <Solr Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast). Confluence supports inline comments that can also be used.

Motivation

The Solr Autoscaling framework was deprecated in later 8.x versions, and removed in the 9.0 release.

Given that Kubernetes has Autoscaling support built-in via the HorizontalPodAutoscaler, and Solr has official support for running on Kubernetes, it seems like a good fit for the next iteration of Solr Autoscaling.

This autoscaling implementation is not meant to replace every part of the removed Autoscaling framework.

The first goal is to support scaling up/down Solr Nodes and moving replicas after scale up/down to spread load evenly.

Public Interfaces

Solr Interfaces

Two new public interfaces are needed, one API addition, one API change and one ReplicaPlacementPlugin method.

API

Utilize new Node:

v1: (If we want a v1 API)

GET /solr/admin/collections?action=UTILIZENODE&node=node-name&sourceNodes=source-node-name-1,source-node-name-2

v2:

POST /api/cluster/nodes/localhost:7574_solr/utilize
{
  "sourceNodes": [], (Optional)
  "waitForFinalState": false,
  "async": "async"
}

Replace Node: (A change to an existing option)

v1: (If we want a v1 API)

GET /solr/admin/collections?action=REPLACENODE&sourceNode=source-node&targetNode=target-node&targetNodes=target-node1,target-node2

The change is the addition of the targetNodes urlParams, which is optional and replaces targetNode.

v2:

POST /api/cluster/nodes/localhost:7574_solr/replace
{
  "targetNodes": [], (Optional) // replaces targetNodeName
  "waitForFinalState": false,
  "async": "async"
}

PlacementPlugin

public interface UtilizeSelectionRequest extends ModificationRequest {}

public interface PlacementPlugin {

List<UtilizeSelection> computeUtilizeSelection(
    Collection<UtilizeSelectionRequest> utilizeSelectionRequests, PlacementContext placementContext)
    throws PlacementException, InterruptedException;
}

This method will compute a list of replicas to be moved from the sourceNodes to the targetNodes.

The UtilizeNode request can then, take this list, create the new replicas using similar logic to the ReplaceNode command, then delete the old replicas afterwards.

Solr Operator Interfaces

SolrCloud CRD:

Two options, either the HPA creation is done by the Solr Operator or by the user. We can support both.

spec:
  ...
  autoscaleReplicas:
    utilizeNodesOnScaleUp: true
    vacateNodesOnScaleDown: true
    hpa:
      create: true
      minimumNodes: 2
      maximumNodes: 10
      metrics:
        ...
   customSolrKubeOptions:
     horizontalPodAutoscalerOptions:
       behavior: ...

If the user want the Solr Operator to create the HPA, they will set "autoscaleReplicas.hpa.create" to true or set it to false if they want to manage it themselves.
Managing the HPA does come with additional burden, but it does allow the Solr Operator to spin up an autoscaling cluster for users with very little intervention.

Its also good for the Solr Operator to know the HPA, so that it can disable it during rolling restarts and other maintenance operations.

Proposed Changes

This feature will require changes to both Solr and the Solr Operator. Since the Solr Operator supports a range of Solr versions, this will not be available for Solr Operator users until they upgrade to a version of Solr that implements this SIP.

Solr Changes

The two main APIs that the Solr Operator would need to call to Solr to implement this functionality are:

Move replicas off of node, because it will no longer be in use
- This already exists, and has been improved in SOLR-15803 to optimally place all replicas from a node across the cluster.
- This needs one more parameter that lets you set multiple targetNodes instead of a single targetNode. (In v2 we can replace the old parameter since its still in experimental state)
Move replicas onto node, because it is now a part of the cluster
- This will be a NEW API, and needs to be implemented. Ideally it will work similarly to the above command, but opposite.

In order to implement this logic, we would need new interfaces and methods in the placement package, as described above. Since we have 4 different built-in PlacementPlugins, we would need to implement this feature for those built in plugins.

Solr Operator Changes

The Solr Operator would need four changes:

If enabled, On scale-down of the statefulset, first move replicas off of the pods that will be deleted.
If enabled, On scale-up of the statefulset, afterwards move replicas onto the pods that have been created.
If the user requests it, create and maintain the HorizontalPodAutoscaler that will do autoscaling for the SolrCloud.
During Cloud maintenance, disable the HorizontalPodAutoscaler if it is managing it for the user.

Compatibility, Deprecation, and Migration Plan

This feature will require changes to both Solr and the Solr Operator. Since the Solr Operator supports a range of Solr versions, this will not be available for Solr Operator users until they upgrade to a version of Solr that implements this SIP.
Existing users of the Solr Operator will see these new features used only when they enable the "autoscaleReplicas" option.
- This option requires new Solr versions, so it cannot be enabled by default until that Solr version is the minimum supported version supported by the Operator.
The Replace Node v2 command will have an API change, but v2 is still experimental, so there should be no concern there.

Security considerations

No Security Concerns

Test Plan

For the Solr APIs, we can use unit tests to test for dispersion for UtilizeNode, like the current ReplaceNode tests.

For the Solr Operator, we will use the e2e testing framework to test that replicas are moved on node scale-up and scale-down.
This is similar to how the tests for Solr Clouds with ephemeral data are done (Replicas have to be moved on replica deletion, since the data is ephemeral).

Rejected Alternatives

No alternatives have been rejected yet.

Space shortcuts

Page tree

Status

Motivation

Public Interfaces

Solr Interfaces

API

PlacementPlugin

Solr Operator Interfaces

Solr Changes

Solr Operator Changes

Security considerations

Test Plan

Rejected Alternatives

Space shortcuts

Page tree

SIP-17: Node Autoscaling via Kubernetes

Status

Motivation

Public Interfaces

Solr Interfaces

API

PlacementPlugin

Solr Operator Interfaces

Solr Changes

Solr Operator Changes

Security considerations

Test Plan

Rejected Alternatives