DRAFT - Not yet in discussion

Status

Current state: Under Discussion

Discussion thread: TBD

JIRA: TBD

Released: <Cassandra Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Scope

todo

Goals

Easy way to enforce system-wide soft and hard limits to prevent anti-patterns of bad usage and in the long run make it not possible to severely degrade the performance of a node/cluster through user actions (too many MVs/secondary indexes per table, ...), thus increasing stability/availability.
As a C* developer it should be easy to add new Guardrails.
Guardrails are disabled by default and there should be no overhead when Guardrails are disabled.

Non-Goals

enforcing limits on a per-user-basis
setting limits dynamically while nodes are running

Timeline

todo

Mailing list / Slack channels

Mailing list:

TBD

Slack channel:

TBD

Discussion threads:

TBD

Related JIRA tickets

JIRA(s):

here

Motivation

Operators of C* want to provide uptime SLAs and prevent users from applying anti-patterns that could potentially bring down a node or severely degrade performance.

Guardrails are a tool to achieve this by setting soft and hard limits that stop users from employing bad practices. For example, when running C* in a cloud environment, as an operator you want to guarantee that certain SLAs can be met by guarding the system from users that would perform CL=ALL writes in a multi-dc cluster.

Audience

Operators of C*

Proposed Changes

The goal of this feature is to have an easy way for operators to enforce system-wide soft and hard limits that ensure good practices, foster availability, and guard the system from wrong usage patterns.

Operators are generally more interested in the overall health of a node/cluster, and so enforcing soft/hard limits on a per-user-basis is not a goal.

The specific guardrails proposed in this spec are intended to be a starting point with the expectation that more guardrails will be defined over time. All guardrails should be of a form that is enforceable when an operation takes place without introducing significant latency.

Guardrail Classes and Configuration

Guardrail: Interface defining a guardrail that guards against a particular usage/condition.
DefaultGuardrail: Abstract class implementing Guardrail. It implements the default behaviour when the guardrail is triggered consisting on throwing warnings or errors.
GuardrailsFactory: Interface defining a factory for building instances of Guardrail.
DefaultGuardrailsFactory: Class implementing GuardrailsFactory, it builds instances of DefaultGuardrail.
CustomGuardrailsFactory: Abstract class instantiating a custom GuardrailsFactory, so users can provide their own implementations of guardrails through a system property named cassandra.custom_guardrails_factory_class.
GuardrailsConfig: Configuration settings for Guardrails, which are populated from cassandra.yaml . This contains a main setting enabled, controlling if Guardrails are globally active or not, and individual settings to control each Guardrail.
cassandra.yaml: allows configuring individual Guardrails, being globally disabled by default.
Guardrails: Entry point for guardrails, storing all the defined guardrail instances and additional helper methods. These Guardrail instances are built at startup with the provided GuardrailsFactory and GuardrailsConfig.

Overview of proposed Guardrails

Guardrails can be in the form of:

numeric threshold with soft/hard limits that trigger a warning or a failure
boolean enabled/disabled flag that triggers a failure
list of disallowed values that trigger a failure
list of ignored values that trigger a warning
pair of boolean predicates that trigger a warning or a failure

Reaching a soft limit should issue a warning, whereas reaching a hard limit issues a failure (InvalidRequestException).

Below is an overview of a set of proposed guardrails with some example limits (which are subject to change):

Parameter	Example Limit	Notes
Single column size	5 MB	Hard limit to prevent writing a large column value
Number of columns per table	50	Hard limit to prevent creating too many columns per table
Number of fields per UDT	10	Hard limit to prevent creating large UDTs
Number of items per collection	20	Soft limit to prevent creating collections with too many items
Size of a collection	5 MB	Soft limit that warns when encountering large collections

Enable user-provided timestamps	true	Whether to allow user-provided timestamps in write requests (USING TIMESTAMP...)
Enable read-before-write list operations	true	Whether to allow read-before-write list operations (setting/removing an item by index)
Enable logged batch	true	Whether to allow LOGGED batches
Enable truncate table	true	Whether to allow the truncation of tables

Disallowed table properties	compression, compaction	List of table properties that are disallowed to be set by users
Disallowed write consistency levels	ANY, ONE, LOCAL_ONE, ALL	List of Consistency Levels that are disallowed to be used during writes

Ignored table properties	default_time_to_live	List of table properties that trigger a warning


Number of secondary indexes per table	1	Hard limit to prevent having lots of secondary indexes per table
Number of SASI indexes per table	1	Hard limit to prevent having lots of SASI indexes per table
Number of MVs per Table	2	Hard limit to prevent having lots of MVs
Number of user-created Tables	100 (soft) / 200 (hard)	Soft limit issues a warning when exceeded and hard limit issues a failure

Large partition size	100 MB	Soft limit that issues a warning when large partitions are being compacted
Number of partition keys in SELECT	20	Hard limit
Cartesian Product of values in IN condition	25	Hard limit. For example "a IN (1,2,...10) AND b IN (1,2...10)" results in cartesian product of 100

Disk usage	70% (soft) / 80% (hard)	Local and Replica Disk usage are monitored to issue warnings/failures when the soft/hard limit is reached

Configuration of Guardrails

Guardrails will be configured via cassandra.yaml settings as shown below (-1 means disabled):

cassandra.yaml settings

# guardrails
  # enabled: false
  # column_value_size_failure_threshold_in_kb: -1
  # columns_per_table_failure_threshold: -1
  # secondary_index_per_table_failure_threshold: -1
  # materialized_view_per_table_failure_threshold: -1
  # tables_warn_threshold: -1
  # tables_failure_threshold: -1
  # table_properties_disallowed: 
  # write_consistency_levels_disallowed:
  # partition_size_warn_threshold_in_mb: -1
  # partition_keys_in_select_failure_threshold: -1
  # disk_usage_percentage_warn_threshold: -1
  # disk_usage_percentage_failure_threshold: -1
  # in_select_cartesian_product_failure_threshold: -1
  # user_timestamps_enabled: true
  # read_before_write_list_operations_enabled: true
  # fields_per_udt_failure_threshold: -1
  # collection_size_warn_threshold_in_kb: -1
  # items_per_collection_warn_threshold: -1

Migrating existing cassandra.yaml warn/fail thresholds

Once Guardrails are implemented, it would make sense to move existing warn/fail thresholds to using guardrails. A few that come to mind are:

tombstone_warn_threshold
tombstone_failure_threshold
batch_size_warn_threshold_in_kb
batch_size_fail_threshold_in_kb
unlogged_batch_across_partitions_warn_threshold
compaction_large_partition_warning_threshold_mb

Distinction from Capability Restrictions

Guardrails allow C* operators to impose system-wide restrictions that are configured through yaml. Capability restrictions are focused on imposing restrictions on particular users and offer a new CQL API to do so. Both concepts are not mutually exclusive and are complementary.

Event logging

In their initial form, Guardrails would issue warnings/failures to the server log file, and also to the client connection when applicable. It would make sense to also emit such events as Diagnostic Events to help troubleshooting these issues. Emitting diagnostic events is an idea for the future and it is not part of this CEP.

Test Plan

unit and integration tests for every single guardrail will be added (handling erroneous input, proper notification of warnings/failures, guardrail boundary settings, no warnings/failures when disabled, ...)

Space shortcuts

Page tree

Status

Scope

Goals

Non-Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Guardrail Classes and Configuration

Overview of proposed Guardrails

Configuration of Guardrails

Migrating existing cassandra.yaml warn/fail thresholds

Distinction from Capability Restrictions

Event logging

Test Plan

Space shortcuts

Page tree

(DRAFT) - CEP-3: Guardrails

Status

Scope

Goals

Non-Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Guardrail Classes and Configuration

Overview of proposed Guardrails

Configuration of Guardrails

Migrating existing cassandra.yaml warn/fail thresholds

Distinction from Capability Restrictions

Event logging

Test Plan