Status

Current state: Under Discussion

Discussion thread: dev discussion

JIRA: CASSANDRA-16052

Released: Unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Scope

Storage Attached Indexing (SAI) is a new secondary index for the Apache Cassandra® distributed database system.

Goals

A new index implementation that builds on the advancements made with SASI.

To provide a solution that enables users to index multiple columns on the same table without suffering scaling problems.

Feature parity with 2i, but will eventually extend to SASI features and beyond.

Not in current CEP scope:

Replace a search engine like Elastic or Solr
Tokenization
Prefix/wildcard/regex
Geo
Sorting

Approach

Add a new index implementation StorageAttachedIndex (SAI).

SAI builds on many of the techniques that were used in SASI:

Being SSTable attached, SAI benefits from operational symmetry with the Cassandra architecture allowing zero copy streaming of indexes.
Upon resolving partition keys, rows are loaded using Cassandra’s internal partition read command across SSTables and are post filtered. Query results are delivered in token clustering key order.
Queries are executed via a skip based merge sorted result set across SSTable and memtable indexes using RangeIterator. Query clauses define the column indexes involved in a query.

In addition SAI introduces:

Sharing of common data between multiple column indexes on the same table.
Compression of posting lists.
A unique implementation that yields significant improvements in numeric range query performance.

The SASI codebase was used as inspiration during the development of SAI. Architecturally SASI provides many benefits over the native 2i implementation.

Timeline

Post 4.0, however, a proposed implementation could be open-sourced earlier for review.

Mailing list / Slack channels

Mailing list:

here

Slack channel:

cassandra-sai

Discussion threads:

save above

Related JIRA tickets

JIRA(s):

CASSANDRA-16052

Motivation

To offer modern and expected indexing features on the Cassandra database platform.

The main advantages of SAI over existing indexes are:

significantly reduced disk usage
great numeric range performance.

In particular, SAI shares common index data across multiple indexes on the same table. This unique feature gives users the ability to create many more indexes without running into scalability issues.

The following charts give an indication of the space saving advantage of using SAI vs alternatives using a financial time series data model based on a relevant user. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available.

Audience

The Cassandra development and user communities.

Proposed Changes

Add a new index type "StorageAttachedIndex":

CREATE CUSTOM INDEX ON person (index_name) USING 'StorageAttachedIndex' WITH OPTIONS = { }

Initial features

AND Queries
Numeric range queries
Non-variable length numeric types
Text type indexes and equality queries
SSTable attached
Optional case sensitivity
Optional unicode normalization
Partition based iteration

Indexing options

case_sensitive - If the index is case sensitive or not.
normalize - If the index uses unicode normalization or not.

Terminology

Row ID - A monotonic increasing integer associated with every row in a sstable. It’s stored in an index structure instead of key token or key offset, because it compresses better.
Postings/posting-list - Sorted row ids that match a given indexed value.
Token file - An index of Row ID -> partition key token for every row in the sstable.
Offset file - An index of Row ID -> partition key offset on the data/primary-index file for every row in the sstable.
Segment - A smallest unit of on-disk indexing structure that is flushed during compaction to reduce memory pressure. Multiple segments of an index are written to the same physical file.

Storage

SAI is optimised for storage. Tokens and offsets are stored once per SSTable. Column indexes access the token and offset files using a row ID. Offsets are compressed using Frame of Reference (FoR) encoding while tokens are not because tokens consume the full 8 bytes and therefore cannot be compressed.

Index implementations need only store an integer row ID in their postings list. Row IDs are translated to decorated key via the token/offset files and SSTableReader#keyAt.

As multiple indexes share the token/offset files, it becomes feasible to index many columns on the same table without significantly increasing the index size.

Indexes may store their data in segments, allowing constrained memory usage and increasing the efficiency of queries where segments can be excluded early.

Index implementation differs depending on the column type being indexed. Currently there are two implementations:

Numeric - Based on a kd-tree
Text - Based on tries.

Numeric indexes

Numeric types, like timestamp, smallint, and double are indexed on disk as a balanced binary search tree with postings at both internal and leaf nodes. A modified version of the one-dimensional block kd-tree from Lucene is used.

Row ID postings are kept separate to the tree to enable efficient retrieval of results in token order. This ordering is required by C*. Leaf and (some) internal nodes have row ID postings lists.

When querying, if a node is covered by the entire range of the query then the entire row ID postings for that node can be added to the result.

If a node is only partially covered by the range then the row IDs are filtered and sorted. To filter the row IDs each leaf node maintains a mapping of value to index within the row ID postings list. At most two leaf nodes will need filtering for a bounded query.

Once the set of postings lists are obtained a sorted merge can be used to stream the results in token order.

Numeric Range Performance Query Results

The kdtree postings design yields constant time query performance across all match sizes with the same query limit.

The following table gives an indication of algorithm performance when running queries against a 4 billion row dataset at 200 queries per second, where the number of range matches is given in the left hand column. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available, and it is expected that the numbers will change.

Text indexes

text, varchar, and ascii types use an inverted index consisting of a dictionary of terms and posting lists for the terms.

The terms dictionary is implemented using a trie data structure which provides excellent term prefix compression and posting file offset is being stored at the leaf node of trie.

Below is a diagram that illustrates the basic structure of the trie. (Note the dotted lines that denote single pages on disk, which allow transitions to be made without additional reads.)

Postings are delta encoded and then laid out into either frame of reference (FoR) encoded blocks (when a maximum block size is reached) or blocks of variable width integers (when a complete block, usually the last one, cannot be created).

Tokenization is out of scope; however, this design does not hinder its implementation.

Write path

Write path is mostly the same as SASI where multiple column indexes are attached to a single memtable.

During flush, SAI will make use of the index memtable to generate an on-disk index file to avoid re-indexing the flushed sstable twice.

Read path

The read path in SAI is similar to the SASI read path with a merge of postings from the in-memory and SSTable indexes using the RangeIterator framework.

New or Changed Public Interfaces

Secondary index group API

SAI uses an extension of the Cassandra secondary index API to allow indexes on the same table to receive centralized lifecycle events called secondary index groups. Sharing of data between multiple column indexes on the same table allows SAI disk usage to realise significant space savings over other index implementations.

Secondary index groups are a breaking change, and other index implementations will need modification for compatibility.

New JMX metrics

Users will have access to extensive JMX metrics to help them isolate performance issues on their clusters.

There are enough that listing them here would be counterproductive; however they fall into the following groups:

Index - Metrics specific to a single index.
Column query - Metrics specific to an indexed column.

Trie index - Low level metrics for trie indexes.
KD-Tree - Low level metrics for kd-tree indexes.

Index group - Metrics for shared resources when multiple columns are indexes on the same table.
Table query - Metrics for all queries on a single table.

Per query - Metrics for queries.

Metrics include items such as: disk usage, memory usage, query latencies, compaction statistics, chunk cache hits/misses/lookups, open files.

Compatibility, Deprecation, and Migration Plan

SAI is an optional feature. Users will not be impacted unless they explicitly use SAI indexes.

Migrating from existing indexes to SAI is out of scope.

Potentially there is scope to bring concepts from SAI to SASI; there are pros and cons to this. While it is desirable to avoid more index implementations, SAI significantly differs in the way that data is stored and queried. Trying to maintain backward compatibility with existing SASI indexes would be difficult within the same index implementation.

At some point the C* community may want to consider making SAI the primary secondary index implementation. However, it is out of scope for this proposal.

Test Plan

SAI will have a combination of unit tests, multi-node distributed tests, and large-scale / heavy workload performance benchmarks.

Unit tests

JUnit tests that cover all aspects of the system including:

CQL
Data definition language
Index format classes
Feature and functional tests

Distributed tests

Cassandra in-jvm dTests that test SAI on multiple cluster sizes.

Long running stress test

A long running distributed test that stresses SAI in various ways.

Performance

Publicly available performance tests that:

Measure performance vs 2i.
Measure performance vs SASI.

Using a range of cluster sizes the tests will measure:

Max indexing throughput.
Max query throughput.
Mixed concurrent read write stability and latencies.

Rejected Alternatives

There have been 4 secondary index concepts thus far:

Native 2i
Lucene
Materialized views

These solutions have the following drawbacks:

2i

The 2i architecture is known to have performance issues.
As indexes are not storage attached it is possible for a secondary index to become out of sync with its primary table.
Advanced indexing like tokenisation regex and geo will never be possible using the 2i architecture.

Read before write based solutions (Lucene/MV)

Cassandra is not fast enough to load an entire row, then do an indexing routine. The read part makes performance much less than an index-less workload. Users find the disparity to be extremely high.
RbW makes realtime impossible due to the resolution wait time.
The indexes may not be sstable attached and therefore cannot travel with the sstable during things like zero copy streaming.

The SASI architecture was the best out of many that were evaluated, and was used as a starting point for SAI.

Space shortcuts

Page tree

Status

Scope

Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Initial features

Indexing options

Terminology

Storage

Numeric indexes

Numeric Range Performance Query Results

Text indexes

Write path

Read path

New or Changed Public Interfaces

Secondary index group API

New JMX metrics

Compatibility, Deprecation, and Migration Plan

Test Plan

Unit tests

Distributed tests

Long running stress test

Performance

Rejected Alternatives

Space shortcuts

Page tree

CEP-7: Storage Attached Index

Status

Scope

Goals

Timeline

Mailing list / Slack channels

Related JIRA tickets

Motivation

Audience

Proposed Changes

Initial features

Indexing options

Terminology

Storage

Numeric indexes

Numeric Range Performance Query Results

Text indexes

Write path

Read path

New or Changed Public Interfaces

Secondary index group API

New JMX metrics

Compatibility, Deprecation, and Migration Plan

Test Plan

Unit tests

Distributed tests

Long running stress test

Performance

Rejected Alternatives