Status

Current state: Discussion ( Unable to render Jira issues macro, execution error. )

Motivation

So far, Solr only has a single type of node, one that is capable of assuming all kinds of tasks. There are usecases where one would like dedicated nodes for specific types of workloads. For example, a dedicated overseer node or a dedicated data node and query node or a node with no data hosted on it, one that can be used for administrative tasks or running plugins etc. Elasticsearch, Vespa etc. have first class support for node roles.

Going forward, once SOLR-15715 is introduced, there would be a distinct role for coordinator nodes. These nodes can be used as query aggregations for distributed requests or streaming expressions and possibly also (later) distributed indexing. This provides for a clean mechanism for users to specify which are the data nodes (stateful) and which are coordinator nodes (stateless), and hence employing heterogeneous deployment strategies.

Scope of this SIP

Concept of roles
Defining a "data" role
Role API and config
Not in scope, but tangentially related: SOLR-15715 and SIP-14, two upcoming features that can leverage node roles.

Proposal

Every node in Solr has one or more “roles”. The following roles are proposed:

“data” role: A node with this role can host data hosting replicas. By default, this is the case for all nodes.
“overseer” role: A node with this role indicates that this node is a preferred overseer. When one or more such nodes are live, Solr guarantees that one of those nodes become the overseer.
“coordinator” role [UPCOMING FEATURE]: This role can be associated with a node to where requests can be sent, and this node sends out other remote calls to data hosting nodes, aggregates the results and sends back to user. This will be useful for dealing with distributed query requests, bulk indexing & streaming expressions based queries. See Unable to render Jira issues macro, execution error. . This is very similar in concept to ElasticSearch's coordinating nodes. A coordinator node would be assumed to have no data hosted on it.
“zookeeper” role [UPCOMING FEATURE]: This role can be associated with nodes that can have embedded ZK nodes. See: https://cwiki.apache.org/confluence/display/SOLR/SIP-14+Embedded+Zookeeper

Notes:

If "-Dsolr.node.roles" parameter is not passed, it is implicitly assumed to be "-Dsolr.nodes.role=data" (due to backcompat reasons and also so that those who don't use the role feature don't need any extra parameters).
Roles are static and immutable for the entire life cycle of a node. Once a node starts up with a role, it registers the role in ZK and that sticks around until the node is stopped/restarted.
The bar for adding new roles in future should be high so it is not abused as any other tag or label for any tiny feature. It should be reserved for functionality that may benefit from a dedicated set of nodes.

Public Interfaces

There will just one supported way to use the roles functionality:

Startup parameters

-Dsolr.node.roles=<comma separated list of roles>

Examples:

Preferred overseer node with no data (dedicated overseer):
-Dsolr.node.roles=overseer
Preferred overseer with data:
-Dsolr.node.roles=overseer,data
Coordinator node (preview for upcoming feature):
-Dsolr.node.roles=coordinator

Cluster API

As of today, there is ADDROLE and REMOVEROLE APIs to add/remove roles at run time to nodes. It supports only OVERSEERROLE. We propose to deprecate this API, and recommend users to use startup params for achieving the same. Supporting both ways is tricky and will lead to a lot of confusion among users.

Example scenario

There's a Solr cluster with the following:

* Layer1: There are about 100 nodes, each node has many data replicas.
* Layer2: To manage such a large cluster reliably, they keep aside 4-5 dedicated overseer nodes.
* Layer3: Since query aggregations/coordination can potentially be expensive, they keep aside 5-10 query nodes.

Proposing the roles as:
* Layer1 nodes are the "data nodes" and hence get either no role defined for them or -Dsolr.node.roles=data.
* Layer2 nodes are "overseer nodes" (though, only one of them can be an overseer at a time). They get -Dsolr.node.roles=overseer
* Layer3 nodes are "coordinator nodes", no data must be hosted on these nodes and they are started with -Dsolr.node.roles=coordinator

How to Retrieve Roles?

Public API

To Read the values use HTTP GET

GET /api/cluster/roles

{

“node1”: [“overseer”],

“node2”: [“overseer”, “data”],

“node3”: [“data”]

}

GET /api/cluster/roles/nodes/node1

[“overseer”]

GET /api/cluster/roles/data

["node2", "node3"]

Internal representation in ZK

All nodes join live_nodes, as is the case today
ZK structure for roles:
- - /node_roles
    - overseer
      znode data: { .. /* some configs for overseer role */ ..}
      - solr1_8983
      - solr2_8983
      - solr3_8983
    - data
      znode data: { .. /* some configs for data role */ ..}
      - solr4_8983
      - solr5_8983
      - solr6_8983
      - solr7_8983
      - ...
    - coordinator (example of a future role)
      znode data: {.. /* configs.. */}
      - solrcoord1_8983
      - ...

Roles During Application Lifecycle:

1) Roles to be configured for a node once a node is started (via sysprops)

2) If at startup, sysprops are present:

a) Yes: If configured roles are found in ZK, overwrite them with roles specified with sysprops. If no configured roles are present, just add the roles in ZK.
b) If no sysprops are present, roles are configured to export the default set of roles (at the time of this SIP, that’s [“data”])

4) Node completes any other necessary startup and publishes itself in live_nodes.

Usage of roles in code:

1) Roles will be checked in publicly published configuration (i.e. roles API, zk), and a watch can be set to detect any change , if required.

2) Roles will not be checked by loading config from disk. (ZK ONLY source of truth)

Other notes

Every time a node starts up with specified roles, the node assumes it is the correct role for that node and publishes those roles in ZK after successful startup.
If a node is started with a -Dsolr.node.roles parameter that doesn't have a data role, but it already has data hosting replicas on it, the startup fails with an error (and a hint indicating how to move replicas away from this replica).
If a coordinator node is started with "data" role also, it fails to startup with a message indicating a node cannot both be coordinator and data node.

Compatibility, Deprecation, and Migration Plan

Deprecate APIS ADDROLE, REMOVEROLE (so that ability to change node roles at runtime is removed).
New V2 API for GET /api/cluster/roles to have nodes as key (deprecating/replacing the current one).

Security considerations

None

Test Plan

Testing should mainly focus on how the nodes behave when roles are added to and removed from the nodes. Also, the API would be tested.

Discussions

Here's the mail thread, including a summary at the end. Gmail - First class support for node roles.pdf.

Rejected Alternatives

There is no proper alternative today. There are awkward ways to achieve similar functionality:

Use autoscaling to stop data (replicas) from being placed on nodes. But, that framework itself has been re-written from Solr 8x to 9x, hence we don’t have a recommendation for users for a consistent way to achieve this. Also, 9x autoscaling framework doesn't support placement plugin chaining, and hence placement plugins shouldn't be used for a first class support of node roles.
OVERSEER role is already available today, it indicates "preferred" overseer.

Discussions (summary)

No negative roles

There shouldn’t be a concept of “not data” or “not overseer” etc.

Everyone agree

Roles on/off by default?

Jason, Ilan, Houston, Jan: All roles should be on by default. Having all roles on by default is less complicated for users, instead of “treating data role differently from other roles”.

Ishan, Noble, ?Gus?: Only those roles to be on by default that are needed for backcompat, so that we don’t take premature decision for any future roles introduced later. When a new role is introduced, whether that new role should be enabled by default or not can be decided then.

Which branch to target?

Jan, Ishan, Noble: New feature to be added to 9x branch

Need for roles?

Tim, Ilan: new concept of nodes unnecessary since everything that's proposed can be achieved using changes to new autoscaling framework and replica placement plugins. “This proposal in its current form (data and overseer roles) doesn't offer much that can't be reasonably achieved by other means” -- Ilan

Ishan, Noble: A first class concept of roles is important so that this functionality is expected to work, irrespective of whatever custom placement plugins users deploy (since placement plugins don't support chaining).

Roles for collections?

Ilan: Role aware collections. “If we make collections role-aware for example (replicas of that collection can only be placed on nodes with a specific role, in addition to the other role based constraints), the set of roles should be user extensible and not fixed.”

Ishan: Role aware collections can be implemented separately later using node roles and placement plugins. As for user extensible roles, a separate concept of user defined node labels (as a separate feature) make more sense. This SIP is more about first class roles (that comes pre-defined with Solr).

Configuration

Sysprops vs solr.xml+sysprops vs envvars:

Shawn: Solr.xml and/or envvars

Houston,Ilan: Sysprops and/or envvars

Ishan,Noble: Sysprops

Jan: SIP-11

Space shortcuts

Page tree

Status

Motivation

Scope of this SIP

Proposal

Public Interfaces

Startup parameters

Cluster API

Example scenario

How to Retrieve Roles?

Other notes

Compatibility, Deprecation, and Migration Plan

Security considerations

Test Plan

Discussions

Rejected Alternatives

Discussions (summary)

Space shortcuts

Page tree

SIP-15 Node roles

Status

Motivation

Scope of this SIP

Proposal

Public Interfaces

Startup parameters

Cluster API

Example scenario

How to Retrieve Roles?

Other notes

Compatibility, Deprecation, and Migration Plan

Security considerations

Test Plan

Discussions

Rejected Alternatives

Discussions (summary)