Status

Current state: Discussion ( Unable to render Jira issues macro, execution error. )

Motivation

So far, Solr only has a single type of node, one that is capable of assuming all kinds of tasks. There are usecases where one would like dedicated nodes for specific types of workloads. For example, a dedicated overseer node or a dedicated data node and query node or a node with no data hosted on it, one that can be used for administrative tasks or running plugins etc. Elasticsearch, Vespa etc. have first class support for node roles.

Going forward, once SOLR-15715 is introduced, there would be a distinct role for coordinator nodes. These nodes can be used as query aggregations for distributed requests or streaming expressions and possibly also (later) distributed indexing. This provides for a clean mechanism for users to specify which are the data nodes (stateful) and which are coordinator nodes (stateless), and hence employing heterogeneous deployment strategies.

Scope of this SIP

  • Concept of roles
  • Defining a "data" role
  • Role API and config
  • Not in scope, but tangentially related: SOLR-15715 and SIP-14, two upcoming features that can leverage node roles.

Proposal

Every node in Solr to have one or more “roles”.

What is a role?

A role is a designation of a node that indicates that the node may perform a certain functionality that is governed by the role. A node that doesn't have a role may not perform the functionality associated with the role.

For example:
- Nodes with "data" role MAY host replicas (i.e. nodes without MAY NOT)
- Nodes with (FUTURE ROLE) "zk" role MAY run zk (i.e. nodes without the role MAY NOT)
- Nodes with (IMAGINARY EXAMPLE) "worker" role MAY execute streaming map/reduce work
- Nodes with (IMAGINARY EXAMPLE) "ingest" role MAY run Tika parsing, OCR, data prepping etc


Modes:

  • Every role also has a list of modes under which a node can be. For certain roles (e.g. overseer) it is useful for potentially finer grained control of how strictly or loosely that role applies for that node.
  • Most roles would just have two modes (on, off)
  • In special cases a role might have more modes, e.g. "overseer" role to have (allowed, disallowed, preferred) modes.
  • For every role, one of the modes is assumed to be a defaultIfAbsent (see below in roles API section, supported GET call), i.e. on a node that doesn't specify that role, the mode for that role that is assumed.
    • For example, if a node starts with "-Dsolr.node.roles=data:on", then it will be assumed that the node has overseer has mode "disallowed" (i.e. the defaultIfAbsent mode of the overseer role).
    • Note: Users don't need to bother about this concept much. This is for tighter representation of the roles and modes in our system for implementation purposes, and for developers implementing new roles.


The following roles are proposed (based on existing functionality):

  1. data” role: A node with this role can host data hosting replicas. By default, this is the case for all nodes. There are two modes (on, off), i.e. a node with role "data:on" can host replicas, whereas nodes with "data:off" cannot host replicas.
  2. overseer” role: A node with this role indicates that this node can act as an overseer. The modes supported as (allowed, disallowed, preferred). (1) Nodes with "overseer:preferred" will be favoured to function as the overseer leader, (2) nodes with "overseer:allowed" can become the overseer leader if no "overseer:preferred" node is live, and (3) nodes with "overseer:disallowed" mean overseer functionality won't run on these nodes.


Roles that might be introduced in future (specifics are outside the scope of this SIP, except for examples):

  1. “coordinator” role [UPCOMING FEATURE]: This role (modes: on/off) can be associated with a node to where requests can be sent, and this node sends out other remote calls to data hosting nodes, aggregates the results and sends back to user. This will be useful for dealing with distributed query requests, bulk indexing & streaming expressions based queries. See Unable to render Jira issues macro, execution error. . This is very similar in concept to ElasticSearch's coordinating nodes. A coordinator node would be assumed to have no data hosted on it.
  2. “zk” role [UPCOMING FEATURE]: This role can be associated with nodes that can have embedded ZK nodes. See: https://cwiki.apache.org/confluence/display/SOLR/SIP-14+Embedded+Zookeeper


Notes:

  1. If "-Dsolr.node.roles" parameter is not passed, it is implicitly assumed to be "-Dsolr.nodes.role=data:on,overseer:allowed" (due to backcompat reasons and also so that those who don't use the role feature don't need any extra parameters).
  2. Roles are static and immutable for the entire life cycle of a node. Once a node starts up with a role, it registers the role in ZK and that sticks around until the node is stopped/restarted.
  3. The bar for adding new roles in future should be high so it is not abused as any other tag or label for any tiny feature. It should be reserved for functionality that may benefit from a dedicated set of nodes.

Public Interfaces

There will just one supported way to use the roles functionality:

Startup parameter (sysprop)

Parameter

Value

Required?

Default

solr.node.roles

Comma separated list of roles (in the format: <role>:<mode>) for this node.
e.g. "data:on,overseer:allowed" or "overseer:preferred"

No

data:on,overseer:allowed


(assumed when parameter is not specified. A subsequent Solr release might have the ability to add a new role here that's turned on by default)


Examples:

  1. Preferred overseer node with no data (dedicated overseer):
     
    -Dsolr.node.roles=overseer:preferred or -Dsolr.node.roles=overseer:preferred,data:off
  2. Preferred overseer with data:
    -Dsolr.node.roles=overseer:preferred,data:on
  3. Regular data node that can also act as an overseer:
    Either
    specify no solr.node.roles param or explicitly specify "-Dsolr.node.roles=data:on,overseer:allowed".
  4. Coordinator node (preview for upcoming feature) that doesn't host data, nor does any overseer duty:
    -Dsolr.node.roles=coordinator:on

Cluster API

As of today, there is ADDROLE and REMOVEROLE APIs to add/remove roles at run time to nodes. It supports only OVERSEERROLE, that designates a preferred overseer. We propose to deprecate this API, and recommend users to use startup params for achieving the same. Supporting both ways (API and startup params) is tricky and will lead to a lot of confusion among users.

Example scenario

There's a Solr cluster with the following:

* Layer1: There are about 100 nodes, each node has many data replicas.
* Layer2: To manage such a large cluster reliably, they keep aside 4-5 dedicated overseer nodes.
* Layer3: Since query aggregations/coordination can potentially be expensive, they keep aside 5-10 query nodes.

Proposing the roles as:
* Layer1 nodes are the "data nodes" and hence get either no role defined for them or -Dsolr.node.roles=data:on,overseer:allowed.
* Layer2 nodes are "overseer nodes" (though, only one of them can be an overseer at a time). They get -Dsolr.node.roles=overseer:preferred
* Layer3 nodes are "coordinator nodes", no data must be hosted on these nodes and they are started with -Dsolr.node.roles=coordinator:on

Note: In this configuration, the actual overseer leader will be one of the nodes in layer2. However, if all nodes in layer2 are down, then one of the layer1 nodes (with overseer:allowed) will become the overseer (until a layer2 node isn't back up).

How to Retrieve Roles?

Public API

To Read the values use HTTP GET

GET /api/cluster/roles

Sample output: {

     “node1”: [“overseer:preferred”],

     “node2”: [“overseer:allowed”, “data:on”],

     “node3”: [“data:on”]

}


GET /api/cluster/roles/supported

Sample output:
{
"overseer": {modes: ["preferred", "allowed", "disallowed"], "defaultIfAbsent": "disallowed"},
"data": {"modes": ["on", "off"], "defaultIfAbsent": "off"}
}

Description: Which roles (and their corresponding modes) do this current Solr cluster support?


GET /api/cluster/roles/nodes/${nodename}

Sample output: [“overseer:preferred”]


GET /api/cluster/roles/${rolename}

Sample output: {"node2": "preferred", "node3": "allowed"}


Internal representation in ZK

  • All nodes join live_nodes, as is the case today
  • ZK structure for roles:
      • /node_roles
        • overseer
          • preferred 
            • nodes
              •  solr1_8983 (ephemeral node) 
              •  solr2_8983 (ephemeral node)
          • allowed
            • nodes
              •  solr3_8983 (ephemeral node)
          • disallowed
            • nodes
              • solr4_8983 (ephemeral node)
              • solr5_8983 (ephemeral node)
              • solrcoord1_8983 (ephemeral node)
        • data
          • on 
            • nodes
              •  solr4_8983 (ephemeral node) 
              •  solr5_8983 (ephemeral node) 
          • off
            • nodes
              • solr1_8983 (ephemeral node)
              • solr2_8983 (ephemeral node)
              • solr3_8983 (ephemeral node)
              • solrcoord1_8983 (ephemeral node)
        • coordinator (example of a future role)
          • on
            • nodes
              • solrcoord1_8983 (ephemeral node)
          • off
            • nodes
              • solr1_8983 (ephemeral node)
              • solr2_8983 (ephemeral node)
              • solr3_8983 (ephemeral node)
              • solr4_8983 (ephemeral node)
              • solr5_8983 (ephemeral node)

Roles During Application Lifecycle:

1) Roles to be configured for a node once a node is started (via sysprops)

2) If at startup, sysprops are present:

  • Yes: Role is published as ephemeral nodes in ZK.
  • No: Roles are configured to export the default set of roles (at the time of this SIP, that’s [data:on,overseer:allowed])

4) Node completes any other necessary startup and publishes itself in live_nodes.

Usage of roles in code:

1) Roles will be checked in publicly published configuration (i.e. roles API, zk)

2) Roles will not be checked by loading config from disk. (ZK ONLY source of truth)

Guidance on adding a new role

  • Do you have a new functionality or existing functionality that you want the users to be able to turn on/off on certain nodes, esp from the point of view of functional (role based) isolation of nodes? Yes: good candidate, No: you might not need a separate role
  • Do you want the functionality associated with the role to be turned on for any user (not already using roles functionality) upgrading to this new Solr version (without having to explicitly turn it on)?
    • Yes: Change the current default value for "solr.node.roles" from "data:on,overseer:allowed" to "data:on,overseer:allowed,myrole:on"
    • No: Either don't change the default of "solr.node.roles" or change it from "data:on,overseer:allowed" to "data:on,overseer:allowed,myrole:off"
  • How to tell users who are already using some roles on their nodes on how to turn on this functionality?
    • In upgrade notes and/or in ref guide, instruct the users with language similar to this: "If you're already explicitly using roles (i.e. you are using "solr.node.roles" for your nodes), then you should append ",myrole:on" to all nodes where you wish to enable this functionality (introduced by myrole)"
  • Designate one of the modes as a defaultIfAbsent. Most likely that's going to be "off" or "disallowed" etc.. This affects only those nodes where some roles are explicitly or implicitly configured, but this new role is not present.

Other notes

  • Every time a node starts up with specified roles, the node assumes it is the correct role for that node and publishes those roles in ZK after successful startup.
  • If a node is started with a -Dsolr.node.roles parameter that doesn't have a data role (or with data:off), but it already has data hosting replicas on it, the startup fails with an error (and a hint indicating how to move replicas away from this replica).

Compatibility, Deprecation, and Migration Plan

  • Deprecate APIS ADDROLE, REMOVEROLE (so that ability to change node roles at runtime is removed).
  • New V2 API for GET /api/cluster/roles to have nodes as key (deprecating/replacing the current one). 

Security considerations

None

Test Plan

Testing should mainly focus on how the nodes behave when roles are added to and removed from the nodes. Also, the API would be tested.

Discussions

Here's the mail thread. roles discussion - 1.pdf (first 100 mails in the thread) and roles discussion - 2.pdf (next 29 mails in the thread).

Rejected Alternatives

There is no proper alternative today. There are awkward ways to achieve similar functionality:

  • Use autoscaling to stop data (replicas) from being placed on nodes. Autoscaling placement rules may be helpful in avoiding replicas getting placed in a certain node. But, that does not mean other nodes can discover who is performing what functionality or tell a node to start with some feature enabled/disabled
  • OVERSEER role is already available today, it indicates "preferred" overseer.


  • No labels