Status

Current state: Adopted

Discussion thread: dev discussion

JIRA: CASSANDRA-16052

Released: Unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Scope

Storage Attached Indexing (SAI) is a new secondary index for the Apache Cassandra® distributed database system.

Goals

SAI is a new index implementation that builds on the advancements made with SASI. It's goals are roughly the following:

Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.
Achieve feature parity with existing 2i, then eventually extend to SASI features and beyond.
Perform well enough to replace entirely both legacy 2i and SASI. (This would entail deprecation of and then the complete removal of SASI, which is currently experimental.)

Non-Goals

Replace a search engine like Elastic or Solr

Approach

Add a new index implementation StorageAttachedIndex (SAI).

SAI builds on many of the techniques that were used in SASI:

Being SSTable attached, SAI benefits from operational symmetry with the Cassandra architecture allowing zero copy streaming of indexes.
Upon resolving partition keys, rows are loaded using Cassandra’s internal partition read command across SSTables and are post filtered. Query results are delivered in token clustering key order.
Queries are executed via a skip based merge sorted result set across SSTable and memtable indexes using RangeIterator. Query clauses define the column indexes involved in a query.

In addition SAI introduces:

Sharing of common data between multiple column indexes on the same table.
Compression of posting lists.
A unique implementation that yields significant improvements in numeric range query performance.

The SASI codebase was used as inspiration during the development of SAI. Architecturally SASI provides many benefits over the native 2i implementation.

Timeline

SAI will target 5.0, with development proceeding in a trunk-based feature branch under the CASSANDRA-16052 Jira epic.

Step 1: Merge Index and related interface changes to feature branch (see CASSANDRA-16092).

Step 2: Merge full SAI implementation into feature branch.

Step 3: Merge SAI to trunk as an experimental feature once testing/validation is complete. (It may not require a feature flag, given all secondary index usage is optional.)

Step 4: Deprecate SASI in 5.0 (given it is already experimental), and remove it in 6.0.

Step 5: Make SAI the default secondary index implementation in DDL space in 5.0, deprecate legacy 2i as well, and remove it in 6.0.

Mailing list / Slack channels

Mailing list:

here

Slack channel:

cassandra-sai

Discussion threads:

save above

Related JIRA tickets

JIRA(s):

CASSANDRA-16052

Motivation

To offer modern and expected indexing features on the Cassandra database platform.

The main advantages of SAI over existing indexes are:

significantly reduced disk usage
great numeric range performance.

In particular, SAI shares common index data across multiple indexes on the same table. This unique feature gives users the ability to create many more indexes without running into scalability issues.

The following charts give an indication of the space saving advantage of using SAI vs alternatives using a financial time series data model based on a relevant user. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available.

Audience

The Cassandra development and user communities.

Proposed Changes

Add a new index type "StorageAttachedIndex":

CREATE CUSTOM INDEX ON person (index_name) USING 'StorageAttachedIndex' WITH OPTIONS = { }

Version 1 Features

AND queries
Numeric range queries
Non-variable length numeric types
Text type indexes and equality queries
SSTable attached
Optional case sensitivity
Optional unicode normalization
Tokenization
Index versioning
Row-aware query path

Version 2 Features

Prefix LIKE
OR Queries
Range queries on all types
Global sorting

Write path

Write path is mostly the same as SASI where multiple column indexes are attached to a single memtable.

During flush, SAI will make use of the index memtable to generate an on-disk index file to avoid re-indexing the flushed sstable twice.

On-disk versioning of indexes is supported by embedding version information in the on-disk file format.

Read path

The read path in SAI is similar to the SASI read path with a merge of postings from the in-memory and SSTable indexes using the RangeIterator framework.

New or Changed Public Interfaces

Secondary index group API

SAI uses an extension of the Cassandra secondary index API to

allow indexes on the same table to receive centralized lifecycle events called secondary index groups. Sharing of data between multiple column indexes on the same table allows SAI disk usage to realise significant space savings over other index implementations.
allow index-group to analyze user query and provide a query plan that leverages all available indexes within the group.

Secondary index groups are a breaking change, and other index implementations will need modification for compatibility.

Compatibility, Deprecation, and Migration Plan

SAI is an optional feature. Users will not be impacted unless they explicitly use SAI indexes.

Migrating from existing indexes to SAI is out of scope.

Potentially there is scope to bring concepts from SAI to SASI; there are pros and cons to this. While it is desirable to avoid more index implementations, SAI significantly differs in the way that data is stored and queried. Trying to maintain backward compatibility with existing SASI indexes would be difficult within the same index implementation.

At some point the C* community may want to consider making SAI the primary secondary index implementation. However, it is out of scope for this proposal.

Test Plan

SAI will have a combination of unit tests, multi-node distributed tests, fuzz tests, and large-scale / heavy workload performance benchmarks.

Unit tests

JUnit tests that cover all aspects of the system including:

CQL
Data definition language
Index format classes
Feature and functional tests

Distributed tests

Cassandra in-jvm dTests that test SAI on multiple cluster sizes.

Fuzz Tests

Harry has matured to the point where we should be able to fuzz any new indexing functionality with minimal (if any) improvements.

Long-Running Stress Test

A long running distributed test that stresses SAI in various ways.

Performance

Publicly available performance tests that:

Measure performance vs 2i.
Measure performance vs SASI.

Using a range of cluster sizes the tests will measure:

Max indexing throughput.
Max query throughput.
Mixed concurrent read write stability and latencies.

Rejected Alternatives

There have been 4 secondary index concepts thus far:

Native 2i
Lucene
Materialized views
SASI

These solutions have the following drawbacks:

2i

The 2i architecture is known to have performance issues.
As indexes are not storage attached it is possible for a secondary index to become out of sync with its primary table.
Advanced indexing like tokenisation regex and geo will never be possible using the 2i architecture.

Read before write based solutions (Lucene/MV)

Cassandra is not fast enough to load an entire row, then do an indexing routine. The read part makes performance much less than an index-less workload. Users find the disparity to be extremely high.
RbW makes realtime impossible due to the resolution wait time.
The indexes may not be sstable attached and therefore cannot travel with the sstable during things like zero copy streaming.

The SASI architecture was the best out of many that were evaluated, and was used as a starting point for SAI.

Challenges

Having outlined the potential benefits of SAI, there are two considerable architectural challenges it will face.

The first is that because SAI index components are SSTable-attached, its performance will be highly dependent on the compaction strategy in use. The more SSTables SAI must search, the slower local queries will be. This is similar to the general problem in C* around hitting too many SSTables for individual reads, but can be much worse for SAI, unless queries are restricted to partitions or small token ranges. A particularly painful case would be a query without token or partition restrictions on a table that uses LCS, which, by design, maintains hundreds or thousands of small SSTables.

Second, all C* secondary index implementations (and SAI would be no different) make distributed queries on top of the range read mechanism. This is problematic for two reasons. The first is that even at a read CL of ONE/LOCAL_ONE, N (# of nodes) / RF (replication factor) nodes/replicas must be contacted for any non partition/token restricted query. At higher CLs, we can effectively contact the entire cluster. Also, C* range queries are made in a single pass per replicas, with all replicas returning responses containing full rows over the wire rather than row identifiers and sorting information (as systems like SolrCloud/ES do, for instance). Historically, attempts have been made to mitigate this by only incrementally querying the token ring, but making multiple passes when the entire ring must be queried introduces considerable latency. The only way to avoid sending large amounts of data over the wire while simultaneously avoiding multiple query rounds is for query results to turn up a small number of results across the entire ring. This is explained in more detail in this blog post.

The only "easy" way around these two challenges is to focus our efforts on queries that are restricted to either partitions or small token ranges. These queries behave well locally even on LCS (given levels contain token-disjoint SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and all of its secondary pitfalls, and allow us to make queries at varying CLs with reasonable performance. Attempting to fix the local problems around compaction strategy could mean either restricted strategy usage or partially abandoning SSTable-attachment. Attempting to fix distributed read path problems by pushing the design towards IR systems like ES could compromise our ability to use higher read CLs.

Addendum

Index Versioning

Index versioning is supported by including the index version in the on-disk file name.

The index version is mapped to a codec internally to write and read the on-disk index files.

Index Format Version 1

The following applies to the version 1 index format.

JMX metrics

Users will have access to extensive JMX metrics to help them isolate performance issues on their clusters.

There are enough that listing them here would be counterproductive; however they fall into the following groups:

Index - Metrics specific to a single index.
Column query - Metrics specific to an indexed column.

Trie index - Low level metrics for trie indexes.
KD-Tree - Low level metrics for kd-tree indexes.

Index group - Metrics for shared resources when multiple columns are indexes on the same table.
Table query - Metrics for all queries on a single table.

Per query - Metrics for queries.

Metrics include items such as: disk usage, memory usage, query latencies, compaction statistics, chunk cache hits/misses/lookups, open files.

Virtual Tables

Sets of the above metrics are also available through virtual tables. These metrics have been grouped into:

Index - global index metrics about the index build and query status and on-disk sizing.
SSTable - per-SSTable metrics
Segment - per-index metrics

The per-SSTable and per-index metrics relate to the on-disk structures described below.

Terminology

Row ID - A monotonic increasing integer associated with every row in a sstable. It’s stored in an index structure instead of key token or key offset, because it compresses better.
Postings/posting-list - Sorted row ids that match a given indexed value.
Primary key - A partition key and clustering representing a row in a SSTable
Primary key store - A bi-directional block store allowing Row ID → Primary Key and Primary Key → Row ID lookups.
Segment - A smallest unit of on-disk indexing structure that is flushed during compaction to reduce memory pressure. Multiple segments of an index are written to the same physical file.

SAI is optimised for storage. The primary key store is written once per SSTable. Column indexes access the primary key store using a row ID. The primary key store uses an on-disk trie containing primary keys to do the primary key to row id lookups and a prefix-compressed block store for row id to primary key lookups.

Index implementations need only store an integer row ID in their postings list. Row IDs are translated to a primary key via the primary key store.

As multiple indexes share the primary key store files, it becomes feasible to index many columns on the same table without significantly increasing the index size.

Indexes may store their data in segments, allowing constrained memory usage and increasing the efficiency of queries where segments can be excluded early.

Index implementation differs depending on the column type being indexed. Currently there are two implementations:

Numeric - Based on a kd-tree
Text - Based on tries.

Numeric indexes

Numeric types, like timestamp, smallint, and double are indexed on disk as a balanced binary search tree with postings at both internal and leaf nodes. A modified version of the one-dimensional block kd-tree from Lucene is used.

Row ID postings are kept separate to the tree to enable efficient retrieval of results in token order. This ordering is required by C*. Leaf and (some) internal nodes have row ID postings lists.

When querying, if a node is covered by the entire range of the query then the entire row ID postings for that node can be added to the result.

If a node is only partially covered by the range then the row IDs are filtered and sorted. To filter the row IDs each leaf node maintains a mapping of value to index within the row ID postings list. At most two leaf nodes will need filtering for a bounded query.

Once the set of postings lists are obtained a sorted merge can be used to stream the results in token order.

Numeric Range Performance Query Results

The kdtree postings design yields constant time query performance across all match sizes with the same query limit.

The following table gives an indication of algorithm performance when running queries against a 4 billion row dataset at 200 queries per second, where the number of range matches is given in the left hand column. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available, and it is expected that the numbers will change.

Text indexes

text, varchar, and ascii types use an inverted index consisting of a dictionary of terms and posting lists for the terms.

The terms dictionary is implemented using a trie data structure which provides excellent term prefix compression and posting file offset is being stored at the leaf node of trie.

Below is a diagram that illustrates the basic structure of the trie. (Note the dotted lines that denote single pages on disk, which allow transitions to be made without additional reads.)

Postings are delta encoded and then laid out into either frame of reference (FoR) encoded blocks (when a maximum block size is reached) or blocks of variable width integers (when a complete block, usually the last one, cannot be created).

Space shortcuts

Page tree

Status

Scope

Goals

Non-Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Version 1 Features

Version 2 Features

Write path

Read path

New or Changed Public Interfaces

Secondary index group API

Compatibility, Deprecation, and Migration Plan

Test Plan

Unit tests

Distributed tests

Fuzz Tests

Long-Running Stress Test

Performance

Rejected Alternatives

Challenges

Addendum

Index Versioning

Index Format Version 1

JMX metrics

Virtual Tables

Terminology

Numeric indexes

Numeric Range Performance Query Results

Text indexes

Space shortcuts

Page tree

CEP-7: Storage Attached Index

Status

Scope

Goals

Non-Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Version 1 Features

Version 2 Features

Write path

Read path

New or Changed Public Interfaces

Secondary index group API

Compatibility, Deprecation, and Migration Plan

Test Plan

Unit tests

Distributed tests

Fuzz Tests

Long-Running Stress Test

Performance

Rejected Alternatives

Challenges

Addendum

Index Versioning

Index Format Version 1

JMX metrics

Virtual Tables

Terminology

Numeric indexes

Numeric Range Performance Query Results

Text indexes