Status

Current state: Implementing

Discussion threadhere

JIRACASSANDRA-14395 – C* Management process

Shepherd: Jason Brown

Contributors: Vinay Chella, Dinesh Joshi, Joseph Lynch

Released: Unreleased

Please keep the discussion on the mailing list or in the JIRA ticket rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Cassandra should work out of the box with near zero configuration for a typical operator. Examples of industry leaders in this area are Consul or Elasticsearch, where you download the database and it automatically clusters, and are fully administered through an industry standard HTTP API. Operators do not need to learn new tools or programming languages to administer these database products.

Simply put, Cassandra is hard to operate. It requires complex configuration management, automation tools, and operator attention to handle the most basic of cluster operations. We believe this is because Cassandra requires:

  • Custom, non-standard tools to operate the database (e.g. nodetool) and complex distributed execution tools to run commands on the whole cluster (as nodetool is only local)

  • Complex automation to perform everyday maintenance tasks such as restarting the database without performance degradation.

  • Configuration of multiple files of different types (yaml, property, xml)

Due to this complexity, many major Cassandra adopters internally build some type of “sidecar” process or processes that runs alongside the Cassandra server daemon. These processes perform various tasks that are essential to run a cluster but aren’t included in the database itself such as health checks, safe shutdown, safe restart, cleanups, configuration, bulk commands, etc. To get a better understanding of the what it takes to reliably operate a Cassandra cluster, please refer to Netflix’s talk at the 2018 Distributed Data Summit - Looking towards an Official Cassandra Sidecar - Netflix

This significant hole in Cassandra’s usability drives us to begin building a solution as part of the official project to make it easier  to operate the database at any scale. This proposal represents the first step towards making Cassandra truly easy to operate. We fully expect these to pair with necessary changes in the Cassandra server, but we aim to minimize required daemon changes in this iteration.

Audience

The primary audience of this tool are Cassandra DevOps / Operators (people who deploy, run & maintain C* clusters). Developers and system integrators are also a target audience.  

Goals

We target two main goals for the first version of the sidecar, both work towards having a easy to use control plane for managing Cassandra’s data plane.

  1. Provide an extensible and pluggable architecture for developers and operators to easily operate Cassandra as well as easing integration with their existing infrastructure. One major sub-goal of this goal is:

    1. The proposal should pass the “curl test”: meaning that it is accessible to standard tooling and out of the box libraries available for practically every environment or programming language (including python, ruby, bash). This means that as a public interface we cannot chose Java specific (jmx) or Cassandra specific (CQL) APIs.

  2. Provide basic but essential and useful functionality. Some proposed scope in this document:

    1. Run health checks on replicas and the cluster

    2. Run diagnostic commands on individual nodes as well as all nodes in the cluster (bulk commands)

    3. Export metrics via pluggable agents rather than polling JMX

    4. Schedule periodic management activities such as running clean ups

    5. (as a stretch goal) safely restart all nodes in the cluster.

We have intentionally and aggressively limited scope to maximize the ability of this CIP to succeed and minimize bikeshedding. We are very interested in making a good, extensible, interface, but not interested in adding significant additional scope to the implementation. Follow up CIPs can add additional scope and features and we will strive as hard as possible to make the interfaces powerful and extensible.

Non-Goals

In the first version we are not aiming to provide:

  1. A project that works on all previous versions of Cassandra. We are explicitly targeting 4.0 and above only. We will strive to architect the system with changing backend interfaces (in particular the JMX to CQL virtual tables) in mind, but we explicitly choose to target 4.0 and above.

  2. Support for non Linux operating systems. For the first version of the sidecar we strive to have excellent support on Linux and in the future we will, of course, add additional OS support.

  3. A comprehensive solution that solves every problem. We instead strive to provide the basic framework & APIs that we can as a community start building upon.

  4. The final implementation of the initial features. We will strive to provide sensible default implementations for functionality such as health checks. However, this can be enhanced in the future or by the operators if they want to tailor it to their specific platform.

  5. A web based UI. We believe that like Elasticsearch where they provided industry standard HTTP management APIs, building web interfaces on top of these APIs will be straightforward. We do not endeavour to build those, however, at this time. Users may choose to build fancier UIs and further tooling around it if they so choose.

Proposed Changes

We propose the following changes to the Cassandra ecosystem:

  1. A new JVM process with separate lifecycle from the server process (aka `sidecar` or `management process`).

  2. A separate installable artifact and start/stop scripts for the new JVM process

  3. RESTful HTTP API with the initial features:

    1. Health Checks 

    2. Bulk execution of pluggable commands

    3. Lifecycle control (start and stop the server instance)

    4. Coordinated management actions

    5. Scheduled management actions

  4. Pluggable metrics agent sidecars for exporting Cassandra metrics in a performant way.

Detailed description of each feature follows broken down into: background, proposed scope, deferred scope, along with concrete examples of what the APIs might look like (obviously the resulting HTTP API may differ slightly based on implementation learnings).

Deferred scope is included to acknowledge what kinds of features we can do but explicitly choose not to do in this iteration. We hope that this explicit acknowledgement will avoid bikeshedding. 

1. Health Checks

Health checks are essential in any system. They can be as simple as checking the existence of a process or as detailed as validating several aspects of the system to determine the true health of a process. Healthchecks are used by load balancers, service discovery, and monitoring checks and are most frequently implemented as HTTP endpoints that respond with 2xx codes to indicate health or 5xx codes to indicate errors.

In the case of Cassandra there are three distinct types of health:

  1. Is the Cassandra process able to serve as a replica

  2. Is the Cassandra process able to serve as a coordinator

  3. Can the Cassandra cluster available at particular consistency levels for a given keyspace.

Proposed Scope

As a v1 feature, we can check multiple aspects of the C* daemon and separate them logically into a RESTful HTTP interface. For example, is the process up and running, is it responding to commands on the native interface, is the in-memory configuration in sync with what is on disk? We believe the following sufficient for v1:

  1. GET /v1/health/coordinator: For determining if a C* node can act as a coordinator, e.g. does it have active native transport, healthy TCP/gossip connections with peers, is cassandra running …

  2. GET /v1/health/replica: For determining if a C* node can safely act as a replica, e.g. is it running cassandra, is the in memory configuration in sync with what is on disk (yaml), are there corrupt sstables present, etc …?

Healthcheck Example
# First a healthy response
$ curl -i -XGET 'localhost:5000/v1/health/coordinator'  
HTTP/1.0 200 OK
Content-Type: application/json
{
  "LOCAL_ONE_underreplicated_keyspaces": [], 
  "alive_peers": [
    "127.0.0.1:7000", 
    "127.0.0.2:7000", 
    "127.0.0.3:7000"
  ], 
  "cassandra_running": true, 
  "dead_peers": []
}
# Example of unhealthy coordinator
$ curl -i -XGET 'localhost:5000/v1/health/coordinator/'         
HTTP/1.0 503 SERVICE UNAVAILABLE
Content-Type: application/json
{
  "LOCAL_ONE_underreplicated_keyspaces": [
    "ks1"
  ], 
  "alive_peers": [
    "127.0.0.1:7000"
  ], 
  "cassandra_running": true, 
  "dead_peers": [
    "127.0.0.2:7000", 
    "127.0.0.3:7000"
  ], 
  "native_active": false
}
# Example of replica health
$ curl -i -XGET 'localhost:5000/v1/health/replica/'              
HTTP/1.0 200 OK
Content-Type: application/json
{
  "cassandra_running": true, 
  "config_sync": "GOOD", 
  "corrupt_sstables": 0, 
  "gossip_active": true, 
  "id": "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262", 
  "tokens": [
    "-9223372036854775808", 
    "-7686143364045646507"
  ]
}

Deferred Scope

  1. GET /v1/health/cluster (stretch): For determining if the Cassandra cluster is partition free, e.g. by asking every node for their coordinator status and combining that into a global health view. Adding cluster state monitoring is not that hard, but we think that we can add it incrementally.

  2. In the future coordinator healthchecks could do more thorough checking like running CQL commands against the local replica to ensure that the storage engine is functional.

2. Bulk Commands


A bulk command is a command that can be issued to any of the sidecar processes running alongside any of the C* daemons. The sidecar then executes the command on a configurable subset of sidecars in the cluster and presents the output to the client.

There are many tools in existence that have this functionality. All of them pretty much rely on ssh access. They ssh into each host / container and run the command, aggregate the output and then return the result to the caller. This isn't a good practice for multiple reasons. SSH access to run a command that queries `nodetool` on each host in the cluster is inefficient, hard to setup and operate. It has additional overhead of provisioning `bot` accounts that get complicated to manage. But most importantly, this also opens up vulnerabilities in your system as a bad actor gaining access to a single node now suddenly has shell access to all nodes in the cluster. The reason everyone chooses to roll something on their own is due to the lack of a better alternative.

The sidecar will be the better alternative. It will have a REST API that will allow users and other sidecars to communicate with it. It will be a completely separate `control plane` from the main C* daemon. The sidecar will act as a `proxy` to execute commands locally but it won't be arbitrary commands. They will be specific, whitelisted commands that affect only the database. This will also mean operators can turn off remote JMX access which is a big source of security issues.

Proposed Scope

The sidecar provides a well defined REST API that should be stable for executing frequent operator actions. We don’t envision replacing all currently accessible JMX commands via this interface but . For “hot config” we can offer something like the following:

  1. PUT /v1/settings: Accepts a JSON object that has keys and values defined as they are in cassandra.yaml. Nested keys have dots in them. This endpoint will update the config option on all or just the local node.
  2. GET /v1/settings: Retrieves the current runtime configuration of the provided node id (or all nodes).
  3. POST /v1/bulkcmd/<cmd>: Accepts a JSON object that contains the parameter for the given cmd. The cmds will run plugins that execute various actions against one or more nodes.
  4. Authentication of the REST API. We expect to support TLS and some type of authorization (e.g. JWT).

# Set a current runtime configuration on a node
$ curl -i -XPUT 'localhost:5000/v1/settings/' -d '{"compaction_throughput_mb_per_sec": 12}' 
HTTP/1.0 200 OK
Content-Type: application/json
{
  "runtime": {
    "compaction_throughput_mb_per_sec": 12
  }
}
# View the current runtime properties of the node
$ curl -i -XGET 'localhost:5000/v1/settings/'                                          
HTTP/1.0 200 OK
Content-Type: application/json
{
  "runtime": {
    "compaction_throughput_mb_per_sec": 12
  }
}
       
# Increase the compaction_throughput using a bulkcmd instead 
$ curl -i -XPOST 'localhost:5000/v1/bulkcmd/setcompactionthroughput' -d '{"rate_in_mb": 12}'
HTTP/1.0 200 OK
{
  "command": "setcompactionthroughput", 
  "failure": [], 
  "params": {
    "rate_in_mb": 12
  }, 
  "success": [
    "127.0.0.1:7000", 
    "127.0.0.3:7000", 
    "127.0.0.3:7000"
  ]
}

Deferred Scope

  1. POST /v1/settings/?refresh=true: Forces Cassandra to reload supported hot properties it can from the cassandra.yaml configuration file. This allows all source of truth to remain with configuration management. While we think this is the right way to build dynamic configuration (allow configuration management to do it), we think it will require non trivial changes to the daemon.
# Refresh the running configuration from cassandra.yaml changes on disk
$ curl -i -XPOST 'localhost:5000/v1/settings/?refresh=true'
{
  "cassandra.yaml": {
    "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4": {
      "compaction_throughput_mb_per_sec": 12...
    }, 
    "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262": {
      "compaction_throughput_mb_per_sec": 12...
    }, 
    "f7965e32-c264-497c-a774-1d04159720a8": {
      "compaction_throughput_mb_per_sec": 12...
    }
  }
}

3. Lifecycle (safely start, stop, restart C*)


Safely starting, stopping, and restarting Cassandra is a tricky affair. Currently you have to somehow get clients to stop sending traffic, and then run a drain procedure (which announces to the rest of the cluster that it is going down, kills ongoing requests, flushes and shuts down). Furthermore many users want to use better startup scripts like systemd.

Being able to easily start and stop a database is crucial. It is basic functionality that has to work. Users don’t know about the difference between flush, drain, disablegossip, disablebinary, etc … they just want to stop the database.

Furthermore most automation using Cassandra needs to block until Cassandra is actually down or actually up and taking traffic before proceeding. Therefore the start and stop actions should allow users to optionally block on those lifecycle changes.

Proposed Scope

We propose that we use the Healthcheck functionality plus pluggable command execution to provide the following easy to use lifecycle:

To stop the daemon:

[running] -> [fail pluggable healthcheck (configurable duration := 30s)]
          -> [drain (configurable timeout := 30s)]
          -> [execute pluggable stop script]

To start the daemon:

[stopped] -> [execute pluggable start script]
          -> [waiting for healthy (configurable timeout := 120s)]
          -> [pluggable healthcheck passing]


We propose a desire based API which allows users to describe their
desires for the state of the cluster, in the spirit of infrastructure as code. This type of solution is significantly more robust and the correct way to build management planes (as opposed to imperitive ones).

  1. GET /v1/desires/node/status: Returns a JSON object with either 200 code for Cassandra is running of 5xx code for Cassandra not running. The endpoint will contain useful information like is the node starting or stopping, what PID is it running as, etc ...
  2. POST /v1/desires/node/stop?block=<bool>: Initiates the orderly shutdown routine and returns a JSON object indicating the state of the stop. If block is supplied this call will block until the full shutdown procedure has been followed, otherwise
  3. POST /v1/desires/node/start?block=<bool>: Initiates the orderly startup routine and returns a JSON object indicating the state of the start. If block is supplied this call will block until the full startup procedure has been followed, otherwise it returns immediately after executing the startup script. 
# Start Cassandra
$ curl -i -XPOST 'localhost:5000/v1/desires/node/start'
HTTP/1.0 200 OK
{
  "cassandra_pid": 1234, 
  "executed": "sudo systemctl start cassandra.service", 
  "healthchecks_passed": [
    "replica", 
    "coordinator", 
    "custom"
  ], 
  "time_to_start_in_s": 0
}

# Safely Stop Cassandra
$ curl -i -XPOST 'localhost:5000/v1/desires/node/stop?block'
HTTP/1.0 200 OK
{
  "cassandra_pid": null, 
  "executed": "sudo systemctl stop cassandra.service", 
  "healthchecks_failed": [
    "coordinator"
  ], 
  "time_to_stop_in_s": 5
}

# View status of Cassandra
curl -i -XGET 'localhost:5000/v1/desires/node/status'    
HTTP/1.0 503 SERVICE UNAVAILABLE
{
  "cassandra_pid": null, 
  "state": "STOPPED"
}

Deferred Scope

  1. This API should not be available remotely, and once we achieve the Coordinated Management goal this can be achieved with only local state transitions. Initially this interface will only be available at localhost for security purposes.

4. Coordinated Management


As a sidecar, there are certain activities which require a well-defined order to execute tasks. Failure to follow the strict coordination might result in an unpredictable or unwanted state of Cassandra. For example, restarting a Cassandra cluster with tunable speed (allow users to do a single node at a time all the way up to entire racks at once) or running upgrades of the cassandra daemon itself.

We believe that Cassandra should provide a simple yet pluggable desire based orchestration engine, where users can declare the desire for Cassandra to do some maintenance task and then the database goes and does it. For example users could ask “any Cassandra process that is older than 2018-10-01 at 12pm should be restarted, respecting availability requirements”. There are various existing (mostly SSH or SQS based) tools that do this kind of orchestration, but they are typically imperative (as opposed to declarative) and are therefore quite brittle, meaning that the maintenance task can get stuck, or worse yet not happen.

For version 1 we choose to target process restarts as they need to be coordinated across datacenters/racks to guarantee high availability of Cassandra to its clients. Restarting Cassandra takes a very long time as a single Cassandra process restart may take minutes. Furthermore, there has to be some way to tell Cassandra clients to go away (e.g. a healthcheck) before taking the node out of service as in-flight requests are dropped during an “nt drain”. Furthermore on hundred node clusters doing a single node at a time is not practical. Instead, you need topology aware restarts which take out a tunable number of nodes between 1 and N where N depends on the user’s setup. For example, if you have five racks you can restart entire racks without worry of losing quorum, but you may not want to lose that much capacity at once.

Proposed Scope

  1. POST /v1/desires/cluster: Accepts a JSON document describing the state desire, and initiates e.g. a restart as appropriate. The actions are explicitly pluggable to allow future extensions. We plan to implement just the restart desire for version 1.
  2. GET /v1/desires/cluster: Returns the current cluster desires state
# Instruct the cluster to restart by Nov  5 13:58:05 PST 2018
$ curl -i -XPOST 'localhost:5000/v1/desires/cluster' -d '{"type": "restart", "by": 1541455085}'
HTTP/1.0 200 OK
{
  "desires": [
    {
      "by": 1541455085, 
      "type": "restart"
    }
  ], 
  "node_states": {
    "RUNNING": [
      [
        "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4", 
        "f7965e32-c264-497c-a774-1d04159720a8"
      ]
    ], 
    "STOPPED": [
      "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262"
    ]
  }
}

# Inspect state
curl -i -XGET 'localhost:5000/v1/desires/cluster'
HTTP/1.0 200 OK
{
  "desires": [
    {
      "by": 1541455085, 
      "type": "restart"
    }
  ], 
  "node_states": {
    "RUNNING": [
      [
        "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4", 
        "f7965e32-c264-497c-a774-1d04159720a8"
      ]
    ], 
    "STOPPED": [
      "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262"
    ]
  }
}

Deferred Scope

  1. POST /v1/desires/cluster with the upgrade type: Accepts a “upgrade” action type. Upgrades requires coordination in a similar fashion as restarts, but activities involved in upgrades might be more disrupting and long-running than restarts. This demands well-agreed coordination with state persistence for longer durations (lease agreement for a longer duration).

5. Scheduled management


Scheduled tasks in a sidecar operate some task on a periodic or scheduled basis (e.g. periodic cleanups, compactions, flushes, etc…). We propose pluggable scheduled jobs which allow users to achieve simple yet powerful operations activities that are frequently required in Cassandra. Basically these are cron jobs.

Proposed Scope

  1. GET /v1/scheduled/node: Shows the scheduled tasks that run on the local host by the sidecar. These are determined via configuration of the sidecar in v1.
  2. Cleanup of tables (nodetool cleanup) in Cassandra will be implemented in version 1. This maintenance activity is an important task when your environment is prone to lose nodes and move nodes all the time. Having `cleanup` activity scheduled on a regular basis helps to maintain the fidelity of the database.
# View scheduled tasks on a node
curl -i -XGET 'localhost:5000/v1/scheduled/node'   
HTTP/1.0 200 OK
{
  "tasks": {
    "cleanup": {
      "exclude_kfs": "*system*", 
      "cron_schedule": "1 12 * * *"
    }
  }
}

Deferred Scope

  1. POST /v1/scheduled/node: Updates the scheduled tasks that run on the local host by the sidecar. We believe that this should be straightforward to do as an extension of the desires infrastructure but do not plan to do it in the first version.
  2. Compactions: Compaction is an essential part of operating Cassandra. Organic/internal compactions in Cassandra are usually self-sustained, but in cases where anti-entrophy (repair) operation either overstreams a lot of data or sends thousands of SSTables, it is inevitable to run external compaction to avoid latency spikes and availability issues.
  3. Backups: The Cassandra daemon provides snapshot capability and incremental backups. It does not provide typically expected database backup functionality such as a point in time streaming backups that use minimal bandwidth and support any downstream store. In the future we could add that to the sidecar but not in the first version 

6. Metrics


Cassandra does provide metrics covering latency, qps, implementation detail metrics (like sstable reads per read) etc … However exporting these metrics in a performant way (i.e. not JMX transport, to be performant you have to use an agent) is quite difficult. Typically everyone invents their own agents for exporting metrics to their particular metrics system (e.g. Prometheus, or spectator, or signalfx, etc …)

Proposed Scope

Provide a well known location for all first class metrics collections agents that follow best practices and are known to be performant can be shipped with the sidecar packaging. Furthermore we propose to make linking these into the daemon as easy as dropping a symlink into a well known directory in the Cassandra home directory (this will require a very minimal amount of Cassandra server changes in the cassandra-env.sh). As part of v1, we are targeting Prometheus agent to be delivered and configured for Cassandra. For example the following code would be sufficient cassandra-env.sh changes to support pluggable metrics agents (or indeed any JVM agent).

# Pull in any agents present in CASSANDRA_HOME                                                                                                                                    
for agent_file in ${CASSANDRA_HOME}/agents/*.jar; do                             
  if [ -e "${agent_file}" ]; then                                                
    base_file="${agent_file%.jar}"                                               
    if [ -s "${base_file}.options" ]; then                                       
      options=`cat ${base_file}.options`                                         
      agent_file="${agent_file}=${options}"                                      
    fi                                                                           
    JVM_OPTS="$JVM_OPTS -javaagent:${agent_file}"                                
  fi                                                                             
done                                                                             
                                                                                 
for agent_file in ${CASSANDRA_HOME}/agents/*.so; do                              
  if [ -e "${agent_file}" ]; then                                                
    base_file="${agent_file%.so}"                                                
    if [ -s "${base_file}.options" ]; then                                       
      options=`cat ${base_file}.options`                                         
      agent_file="${agent_file}=${options}"                                      
    fi                                                                           
    JVM_OPTS="$JVM_OPTS -agentpath:${agent_file}"                                
  fi                                                                             
done  


Deferred Scope

  • Long term we can provide a HTTP interface in the sidecar to the JMX metrics, which we think would be very useful, but at this time we don’t plan to implement it.


New or Changed Public Interfaces

For this change, we’re striving to keep changes minimal in the server itself. For pluggable metrics agent support we may need to add about 10 lines of very safe bash to the cassandra-env.sh script to inject the metrics agents but other than that we don’t think there will be any changes required to C*’s public or private interfaces. Furthermore as we do not plan to support this interface in this version, merely prototype and release an experimental feature, we do not consider the HTTP interface added here to be part of the "public" interface at this time.

Compatibility, Deprecation, and Migration Plan

Existing users C* will not be impacted as this is a new feature. If we can prove value users will need to find a way to migrate to this sidecar instead of running their own implementation.

We do not plan on removing existing behavior, just integrating with it and making it easier for users to use.

Test Plan

The changes will be merged to the sidecar repo with considerable unit tests and we will contribute e2e dtests (that are optional to run) for the HTTP API. The dtests will serve as integration tests.

Rejected Alternatives

The well known alternative is Priam. Multiple engineers (Jason Brown, Joseph Lynch, Vinay Chella) who work for Netflix and have worked in the past or do work currently on Priam don’t believe it is the correct foundation for the general open source community. They believe this because it  has a lot of baggage that is unnecessary to bring in and often solves Netflix’s problems specifically as opposed to the general community.

There are many tools which fulfil part of the specification here:

  • Jolokia. This would provide an HTTP interface to the Cassandra mbeans, but it doesn’t translate them into a stable maintenance API which tools developers can code against with confidence.

  • Various “ssh in a for loop” management solutions for running commands. Some of these tools are even topology aware, but they all are based on a fundamentally flawed control plane methodology (push based via SSH), which simply does not scale. Instead we need a distributed control plane that coordinates through state transitions instead of imperitive actions.

  • Reaper solves the problem of scheduling repair, but it is not extensible to other type of maintenance and we believe that long term tools like Reaper can complement this sidecar.

The proposed sidecars aim to support Cassandra 4.0 and beyond. There are various tools in the community but none of them provide the unified functionality and advanced featureset we propose here.

  • No labels