Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Implementing

...

Please keep the discussion on the mailing list or in the JIRA ticket rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Cassandra should work out of the box with near zero configuration for a typical operator. Examples of industry leaders in this area are Consul or Elasticsearch, where you download the database and it automatically clusters, and are fully administered through an industry standard HTTP API. Operators do not need to learn new tools or programming languages to administer these database products.

...

This significant hole in Cassandra’s usability drives us to begin building a solution as part of the official project to make it easier  to operate the database at any scale. This proposal represents the first step towards making Cassandra truly easy to operate. We fully expect these to pair with necessary changes in the Cassandra server, but we aim to minimize required daemon changes in this iteration.

Audience

The primary audience of this tool are Cassandra DevOps / Operators (people who deploy, run & maintain C* clusters). Developers and system integrators are also a target audience.  

Goals

We target two main goals for the first version of the sidecar, both work towards having a easy to use control plane for managing Cassandra’s data plane.

  1. Provide an extensible and pluggable architecture for developers and operators to easily operate Cassandra as well as easing integration with their existing infrastructure. One major sub-goal of this goal is:

    1. The proposal should pass the “curl test”: meaning that it is accessible to standard tooling and out of the box libraries available for practically every environment or programming language (including python, ruby, bash). This means that as a public interface we cannot chose Java specific (jmx) or Cassandra specific (CQL) APIs.

  2. Provide basic but essential and useful functionality. Some proposed scope in this document:

    1. Run health checks on replicas and the cluster

    2. Run diagnostic commands on individual nodes as well as all nodes in the cluster (bulk commands)

    3. Export metrics via pluggable agents rather than polling JMX

    4. Schedule periodic management activities such as running clean ups

    5. (as a stretch goal) safely restart all nodes in the cluster.

We have intentionally and aggressively limited scope to maximize the ability of this CIP to succeed and minimize bikeshedding. We are very interested in making a good, extensible, interface, but not interested in adding significant additional scope to the implementation. Follow up CIPs can add additional scope and features and we will strive as hard as possible to make the interfaces powerful and extensible.

Non-Goals

In the first version we are not aiming to provide:

  1. A project that works on all previous versions of Cassandra. We are explicitly targeting 4.0 and above only. We will strive to architect the system with changing backend interfaces (in particular the JMX to CQL virtual tables) in mind, but we explicitly choose to target 4.0 and above.

  2. Support for non Linux operating systems. For the first version of the sidecar we strive to have excellent support on Linux and in the future we will, of course, add additional OS support.

  3. A comprehensive solution that solves every problem. We instead strive to provide the basic framework & APIs that we can as a community start building upon.

  4. The final implementation of the initial features. We will strive to provide sensible default implementations for functionality such as health checks. However, this can be enhanced in the future or by the operators if they want to tailor it to their specific platform.

  5. A web based UI. We believe that like Elasticsearch where they provided industry standard HTTP management APIs, building web interfaces on top of these APIs will be straightforward. We do not endeavour to build those, however, at this time. Users may choose to build fancier UIs and further tooling around it if they so choose.

Proposed Changes

Anchor
proposal
proposal

We propose the following changes to the Cassandra ecosystem:

  1. A new JVM process with separate lifecycle from the server process (aka `sidecar` or `management process`).

  2. A separate installable artifact and start/stop scripts for the new JVM process

  3. RESTful HTTP API with the initial features:

    1. Health Checks 

    2. Bulk execution of pluggable commands

    3. Lifecycle control (start and stop the server instance)

    4. Coordinated management actions

    5. Scheduled management actions

  4. Pluggable metrics agent sidecars for exporting Cassandra metrics in a performant way.

...

Deferred scope is included to acknowledge what kinds of features we can do but explicitly choose not to do in this iteration. We hope that this explicit acknowledgement will avoid bikeshedding. 

1. Health Checks

Anchor
healthchecks
healthchecks

...

  1. Is the Cassandra process able to serve as a replica

  2. Is the Cassandra process able to serve as a coordinator

  3. Can the Cassandra cluster available at particular consistency levels for a given keyspace.

Proposed Scope

As a v1 feature, we can check multiple aspects of the C* daemon and separate them logically into a RESTful HTTP interface. For example, is the process up and running, is it responding to commands on the native interface, is the in-memory configuration in sync with what is on disk? We believe the following sufficient for v1:

...

Code Block
languagebash
titleHealthcheck Example
# First a healthy response
$ curl -i -XGET 'localhost:5000/v1/health/coordinator'  
HTTP/1.0 200 OK
Content-Type: application/json
{
  "LOCAL_ONE_underreplicated_keyspaces": [], 
  "alive_peers": [
    "127.0.0.1:7000", 
    "127.0.0.2:7000", 
    "127.0.0.3:7000"
  ], 
  "cassandra_running": true, 
  "dead_peers": []
}
# Example of unhealthy coordinator
$ curl -i -XGET 'localhost:5000/v1/health/coordinator/'         
HTTP/1.0 503 SERVICE UNAVAILABLE
Content-Type: application/json
{
  "LOCAL_ONE_underreplicated_keyspaces": [
    "ks1"
  ], 
  "alive_peers": [
    "127.0.0.1:7000"
  ], 
  "cassandra_running": true, 
  "dead_peers": [
    "127.0.0.2:7000", 
    "127.0.0.3:7000"
  ], 
  "native_active": false
}
# Example of replica health
$ curl -i -XGET 'localhost:5000/v1/health/replica/'              
HTTP/1.0 200 OK
Content-Type: application/json
{
  "cassandra_running": true, 
  "config_sync": "GOOD", 
  "corrupt_sstables": 0, 
  "gossip_active": true, 
  "id": "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262", 
  "tokens": [
    "-9223372036854775808", 
    "-7686143364045646507"
  ]
}

Deferred Scope

  1. GET /v1/health/cluster (stretch): For determining if the Cassandra cluster is partition free, e.g. by asking every node for their coordinator status and combining that into a global health view. Adding cluster state monitoring is not that hard, but we think that we can add it incrementally.

  2. In the future coordinator healthchecks could do more thorough checking like running CQL commands against the local replica to ensure that the storage engine is functional.

2. Bulk Commands

Anchor
bulkcmd
bulkcmd

A bulk command is a command that can be issued to any of the sidecar processes running alongside any of the C* daemons. The sidecar then executes the command on a configurable subset of sidecars in the cluster and presents the output to the client.

...

The sidecar will be the better alternative. It will have a REST API that will allow users and other sidecars to communicate with it. It will be a completely separate `control plane` from the main C* daemon. The sidecar will act as a `proxy` to execute commands locally but it won't be arbitrary commands. They will be specific, whitelisted commands that affect only the database. This will also mean operators can turn off remote JMX access which is a big source of security issues.

Proposed Scope

The sidecar provides a well defined REST API that should be stable for executing frequent operator actions. We don’t envision replacing all currently accessible JMX commands via this interface but . For “hot config” we can offer something like the following:

...

Code Block
# Set a current runtime configuration on a node
$ curl -i -XPUT 'localhost:5000/v1/settings/' -d '{"compaction_throughput_mb_per_sec": 12}' 
HTTP/1.0 200 OK
Content-Type: application/json
{
  "runtime": {
    "compaction_throughput_mb_per_sec": 12
  }
}
# View the current runtime properties of the node
$ curl -i -XGET 'localhost:5000/v1/settings/'                                          
HTTP/1.0 200 OK
Content-Type: application/json
{
  "runtime": {
    "compaction_throughput_mb_per_sec": 12
  }
}
       
# Increase the compaction_throughput using a bulkcmd instead 
$ curl -i -XPOST 'localhost:5000/v1/bulkcmd/setcompactionthroughput' -d '{"rate_in_mb": 12}'
HTTP/1.0 200 OK
{
  "command": "setcompactionthroughput", 
  "failure": [], 
  "params": {
    "rate_in_mb": 12
  }, 
  "success": [
    "127.0.0.1:7000", 
    "127.0.0.3:7000", 
    "127.0.0.3:7000"
  ]
}

Deferred Scope

  1. POST /v1/settings/?refresh=true: Forces Cassandra to reload supported hot properties it can from the cassandra.yaml configuration file. This allows all source of truth to remain with configuration management. While we think this is the right way to build dynamic configuration (allow configuration management to do it), we think it will require non trivial changes to the daemon.
Code Block
# Refresh the running configuration from cassandra.yaml changes on disk
$ curl -i -XPOST 'localhost:5000/v1/settings/?refresh=true'
{
  "cassandra.yaml": {
    "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4": {
      "compaction_throughput_mb_per_sec": 12...
    }, 
    "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262": {
      "compaction_throughput_mb_per_sec": 12...
    }, 
    "f7965e32-c264-497c-a774-1d04159720a8": {
      "compaction_throughput_mb_per_sec": 12...
    }
  }
}

3. Lifecycle (safely start, stop, restart C*)

Anchor
lifecycle
lifecycle

Safely starting, stopping, and restarting Cassandra is a tricky affair. Currently you have to somehow get clients to stop sending traffic, and then run a drain procedure (which announces to the rest of the cluster that it is going down, kills ongoing requests, flushes and shuts down). Furthermore many users want to use better startup scripts like systemd.

...

Furthermore most automation using Cassandra needs to block until Cassandra is actually down or actually up and taking traffic before proceeding. Therefore the start and stop actions should allow users to optionally block on those lifecycle changes.

Proposed Scope

We propose that we use the Healthcheck functionality plus pluggable command execution to provide the following easy to use lifecycle:

...

Code Block
# Start Cassandra
$ curl -i -XPOST 'localhost:5000/v1/desires/node/start'
HTTP/1.0 200 OK
{
  "cassandra_pid": 1234, 
  "executed": "sudo systemctl start cassandra.service", 
  "healthchecks_passed": [
    "replica", 
    "coordinator", 
    "custom"
  ], 
  "time_to_start_in_s": 0
}

# Safely Stop Cassandra
$ curl -i -XPOST 'localhost:5000/v1/desires/node/stop?block'
HTTP/1.0 200 OK
{
  "cassandra_pid": null, 
  "executed": "sudo systemctl stop cassandra.service", 
  "healthchecks_failed": [
    "coordinator"
  ], 
  "time_to_stop_in_s": 5
}

# View status of Cassandra
curl -i -XGET 'localhost:5000/v1/desires/node/status'    
HTTP/1.0 503 SERVICE UNAVAILABLE
{
  "cassandra_pid": null, 
  "state": "STOPPED"
}

Deferred Scope

  1. This API should not be available remotely, and once we achieve the Coordinated Management goal this can be achieved with only local state transitions. Initially this interface will only be available at localhost for security purposes.

4. Coordinated Management

Anchor
coordinatedmanagement
coordinatedmanagement

...

For version 1 we choose to target process restarts as they need to be coordinated across datacenters/racks to guarantee high availability of Cassandra to its clients. Restarting Cassandra takes a very long time as a single Cassandra process restart may take minutes. Furthermore, there has to be some way to tell Cassandra clients to go away (e.g. a healthcheck) before taking the node out of service as in-flight requests are dropped during an “nt drain”. Furthermore on hundred node clusters doing a single node at a time is not practical. Instead, you need topology aware restarts which take out a tunable number of nodes between 1 and N where N depends on the user’s setup. For example, if you have five racks you can restart entire racks without worry of losing quorum, but you may not want to lose that much capacity at once.

Proposed Scope

  1. POST /v1/desires/cluster: Accepts a JSON document describing the state desire, and initiates e.g. a restart as appropriate. The actions are explicitly pluggable to allow future extensions. We plan to implement just the restart desire for version 1.
  2. GET /v1/desires/cluster: Returns the current cluster desires state
Code Block
# Instruct the cluster to restart by Nov  5 13:58:05 PST 2018
$ curl -i -XPOST 'localhost:5000/v1/desires/cluster' -d '{"type": "restart", "by": 1541455085}'
HTTP/1.0 200 OK
{
  "desires": [
    {
      "by": 1541455085, 
      "type": "restart"
    }
  ], 
  "node_states": {
    "RUNNING": [
      [
        "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4", 
        "f7965e32-c264-497c-a774-1d04159720a8"
      ]
    ], 
    "STOPPED": [
      "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262"
    ]
  }
}

# Inspect state
curl -i -XGET 'localhost:5000/v1/desires/cluster'
HTTP/1.0 200 OK
{
  "desires": [
    {
      "by": 1541455085, 
      "type": "restart"
    }
  ], 
  "node_states": {
    "RUNNING": [
      [
        "9e6b2547-bbc1-44cc-bd33-73a12d8e92f4", 
        "f7965e32-c264-497c-a774-1d04159720a8"
      ]
    ], 
    "STOPPED": [
      "eff7c7de-c3c4-4d3a-8bd7-7d4dd67b4262"
    ]
  }
}

Deferred Scope

  1. POST /v1/desires/cluster with the upgrade type: Accepts a “upgrade” action type. Upgrades requires coordination in a similar fashion as restarts, but activities involved in upgrades might be more disrupting and long-running than restarts. This demands well-agreed coordination with state persistence for longer durations (lease agreement for a longer duration).

5. Scheduled management

Anchor
scheduledmanagement
scheduledmanagement

Scheduled tasks in a sidecar operate some task on a periodic or scheduled basis (e.g. periodic cleanups, compactions, flushes, etc…). We propose pluggable scheduled jobs which allow users to achieve simple yet powerful operations activities that are frequently required in Cassandra. Basically these are cron jobs.

Proposed Scope

  1. GET /v1/scheduled/node: Shows the scheduled tasks that run on the local host by the sidecar. These are determined via configuration of the sidecar in v1.
  2. Cleanup of tables (nodetool cleanup) in Cassandra will be implemented in version 1. This maintenance activity is an important task when your environment is prone to lose nodes and move nodes all the time. Having `cleanup` activity scheduled on a regular basis helps to maintain the fidelity of the database.
Code Block
# View scheduled tasks on a node
curl -i -XGET 'localhost:5000/v1/scheduled/node'   
HTTP/1.0 200 OK
{
  "tasks": {
    "cleanup": {
      "exclude_kfs": "*system*", 
      "cron_schedule": "1 12 * * *"
    }
  }
}

Deferred Scope

  1. POST /v1/scheduled/node: Updates the scheduled tasks that run on the local host by the sidecar. We believe that this should be straightforward to do as an extension of the desires infrastructure but do not plan to do it in the first version.
  2. Compactions: Compaction is an essential part of operating Cassandra. Organic/internal compactions in Cassandra are usually self-sustained, but in cases where anti-entrophy (repair) operation either overstreams a lot of data or sends thousands of SSTables, it is inevitable to run external compaction to avoid latency spikes and availability issues.
  3. Backups: The Cassandra daemon provides snapshot capability and incremental backups. It does not provide typically expected database backup functionality such as a point in time streaming backups that use minimal bandwidth and support any downstream store. In the future we could add that to the sidecar but not in the first version 

6. Metrics

Anchor
metrics
metrics

Cassandra does provide metrics covering latency, qps, implementation detail metrics (like sstable reads per read) etc … However exporting these metrics in a performant way (i.e. not JMX transport, to be performant you have to use an agent) is quite difficult. Typically everyone invents their own agents for exporting metrics to their particular metrics system (e.g. Prometheus, or spectator, or signalfx, etc …)

Proposed Scope

Provide a well known location for all first class metrics collections agents that follow best practices and are known to be performant can be shipped with the sidecar packaging. Furthermore we propose to make linking these into the daemon as easy as dropping a symlink into a well known directory in the Cassandra home directory (this will require a very minimal amount of Cassandra server changes in the cassandra-env.sh). As part of v1, we are targeting Prometheus agent to be delivered and configured for Cassandra. For example the following code would be sufficient cassandra-env.sh changes to support pluggable metrics agents (or indeed any JVM agent).

Code Block
# Pull in any agents present in CASSANDRA_HOME                                                                                                                                    
for agent_file in ${CASSANDRA_HOME}/agents/*.jar; do                             
  if [ -e "${agent_file}" ]; then                                                
    base_file="${agent_file%.jar}"                                               
    if [ -s "${base_file}.options" ]; then                                       
      options=`cat ${base_file}.options`                                         
      agent_file="${agent_file}=${options}"                                      
    fi                                                                           
    JVM_OPTS="$JVM_OPTS -javaagent:${agent_file}"                                
  fi                                                                             
done                                                                             
                                                                                 
for agent_file in ${CASSANDRA_HOME}/agents/*.so; do                              
  if [ -e "${agent_file}" ]; then                                                
    base_file="${agent_file%.so}"                                                
    if [ -s "${base_file}.options" ]; then                                       
      options=`cat ${base_file}.options`                                         
      agent_file="${agent_file}=${options}"                                      
    fi                                                                           
    JVM_OPTS="$JVM_OPTS -agentpath:${agent_file}"                                
  fi                                                                             
done  


Deferred Scope

  • Long term we can provide a HTTP interface in the sidecar to the JMX metrics, which we think would be very useful, but at this time we don’t plan to implement it.

...

For this change, we’re striving to keep changes minimal in the server itself. For pluggable metrics agent support we may need to add about 10 lines of very safe bash to the cassandra-env.sh script to inject the metrics agents but other than that we don’t think there will be any changes required to C*’s public or private interfaces. Furthermore as we do not plan to support this interface in this version, merely prototype and release an experimental feature, we do not consider the HTTP interface added here to be part of the "public" interface at this time.

Compatibility, Deprecation, and Migration Plan

Existing users C* will not be impacted as this is a new feature. If we can prove value users will need to find a way to migrate to this sidecar instead of running their own implementation.

We do not plan on removing existing behavior, just integrating with it and making it easier for users to use.

Test Plan

The changes will be merged to the sidecar repo with considerable unit tests and we will contribute e2e dtests (that are optional to run) for the HTTP API. The dtests will serve as integration tests.

Rejected Alternatives

The well known alternative is Priam. Multiple engineers (Jason Brown, Joseph Lynch, Vinay Chella) who work for Netflix and have worked in the past or do work currently on Priam don’t believe it is the correct foundation for the general open source community. They believe this because it  has a lot of baggage that is unnecessary to bring in and often solves Netflix’s problems specifically as opposed to the general community.

...