Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents


...

Status

Current state:  Under DiscussionAdopted

Discussion threaddev discussion

JIRACASSANDRA-16052

Released: Unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Storage Attached Indexing (SAI) is a new secondary index for the Apache Cassandra® distributed database system.

Goals

A SAI is a new index implementation that builds on the advancements made with SASI. To provide It's goals are roughly the following:

  • Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.

...

  • Achieve feature parity with existing 2i,

...

  • then eventually extend to SASI features and beyond.

...

  • Perform well enough to replace entirely both legacy 2i and SASI. (This would entail deprecation of and then the complete removal of SASI, which is currently experimental.)

Non-Goals

  • Replace a search engine like Elastic or Solr
  • Tokenization
  • Prefix/wildcard/regex
  • Geo
  • Sorting


Approach

Add a new index implementation StorageAttachedIndex (SAI).

...

The SASI codebase was used as inspiration during the development of SAI. Architecturally SASI provides many benefits over the native 2i implementation.

Timeline

Post 4SAI will target 5.0, however, a proposed implementation could be open-sourced earlier for reviewwith development proceeding in a trunk-based feature branch under the CASSANDRA-16052 Jira epic.

Step 1: Merge Index and related interface changes to Apache Cassandrafeature branch (see CASSANDRA-16092).

Step 2: Merge full SAI implementation into Apache Cassandrafeature branch.

Step 3: Replace SASI or 2I when SAI is considered production readyMerge SAI to trunk as an experimental feature once testing/validation is complete. (It may not require a feature flag, given all secondary index usage is optional.)

Step 4: Deprecate SASI in 5.0 (given it is already experimental), and remove it in 6.0.

Step 5: Make SAI the default secondary index implementation in DDL space in 5.0, deprecate legacy 2i as well, and remove it in 6.0.

Mailing list / Slack channels

...

    CREATE CUSTOM INDEX ON person (index_name) USING 'StorageAttachedIndex' WITH OPTIONS = { }

...

Version 1 Features

  • AND Queriesqueries
  • Numeric range queries
  • Non-variable length numeric types
  • Text type indexes and equality queries
  • SSTable attached
  • Optional case sensitivity
  • Optional unicode normalization
  • Partition based iteration

Indexing options

  • case_sensitive - If the index is case sensitive or not.
  • normalize - If the index uses unicode normalization or not.

Terminology

  • Row ID - A monotonic increasing integer associated with every row in a sstable. It’s stored in an index structure instead of key token or key offset, because it compresses better.
  • Postings/posting-list - Sorted row ids that match a given indexed value. 
  • Token file - An index of Row ID -> partition key token for every row in the sstable.
  • Offset file - An index of Row ID -> partition key offset on the data/primary-index file for every row in the sstable.
  • Segment - A smallest unit of on-disk indexing structure that is flushed during compaction to reduce memory pressure. Multiple segments of an index are written to the same physical file.

Storage

SAI is optimised for storage. Tokens and offsets are stored once per SSTable.  Column indexes access the token and offset files using a row ID. Offsets are compressed using Frame of Reference (FoR) encoding while tokens are not because tokens consume the full 8 bytes and therefore cannot be compressed.

Index implementations need only store an integer row ID in their postings list. Row IDs are translated to decorated key via the token/offset files and SSTableReader#keyAt.

Image Removed

As multiple indexes share the token/offset files, it becomes feasible to index many columns on the same table without significantly increasing the index size.

Indexes may store their data in segments, allowing constrained memory usage and increasing the efficiency of queries where segments can be excluded early.

Index implementation differs depending on the column type being indexed. Currently there are two implementations:

  • Numeric - Based on a kd-tree
  • Text - Based on tries.

Numeric indexes

Numeric types, like timestamp, smallint, and double are indexed on disk as a balanced binary search tree with postings at both internal and leaf nodes. A modified version of the one-dimensional block kd-tree from Lucene is used.

Image Removed

Row ID postings are kept separate to the tree to enable efficient retrieval of results in token order. This ordering is required by C*. Leaf and (some) internal nodes have row ID postings lists. 

When querying, if a node is covered by the entire range of the query then the entire row ID postings for that node can be added to the result. 

If a node is only partially covered by the range then the row IDs are filtered and sorted. To filter the row IDs each leaf node maintains a mapping of value to index within the row ID postings list. At most two leaf nodes will need filtering for a bounded query.

Once the set of postings lists are obtained a sorted merge can be used to stream the results in token order.

Numeric Range Performance Query Results

The kdtree postings design yields constant time query performance across all match sizes with the same query limit.

The following table gives an indication of algorithm performance when running queries against a 4 billion row dataset at 200 queries per second, where the number of range matches is given in the left hand column. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available, and it is expected that the numbers will change.

Image Removed

Text indexes

text, varchar, and ascii types use an inverted index consisting of a dictionary of terms and posting lists for the terms.

The terms dictionary is implemented using a trie data structure which provides excellent term prefix compression and posting file offset is being stored at the leaf node of trie.

Below is a diagram that illustrates the basic structure of the trie. (Note the dotted lines that denote single pages on disk, which allow transitions to be made without additional reads.)

Image Removed

Postings are delta encoded and then laid out into either frame of reference (FoR) encoded blocks (when a maximum block size is reached) or blocks of variable width integers (when a complete block, usually the last one, cannot be created).

...

  • Tokenization
  • Index versioning
  • Row-aware query path

Version 2 Features

  • Prefix LIKE
  • OR Queries
  • Range queries on all types
  • Global sorting

Write path

Write path is mostly the same as SASI where multiple column indexes are attached to a single memtable.

During flush, SAI will make use of the index memtable to generate an on-disk index file to avoid re-indexing the flushed sstable twice.

On-disk versioning of indexes is supported by embedding version information in the on-disk file format.

Read path

The read path in SAI is similar to the SASI read path with a merge of postings from the in-memory and SSTable indexes using the RangeIterator framework.

...


Secondary index groups are a breaking change, and other index implementations will need modification for compatibility.

New JMX metrics

Users will have access to extensive JMX metrics to help them isolate performance issues on their clusters.

There are enough that listing them here would be counterproductive; however they fall into the following groups:

  • Index - Metrics specific to a single index.
  • Column query - Metrics specific to an indexed column.
    • Trie index - Low level metrics for trie indexes.
    • KD-Tree - Low level metrics for kd-tree indexes.
  • Index group - Metrics for shared resources when multiple columns are indexes on the same table.
  • Table query - Metrics for all queries on a single table.
    • Per query - Metrics for queries.

Metrics include items such as: disk usage, memory usage, query latencies, compaction statistics, chunk cache hits/misses/lookups, open files.

Compatibility, Deprecation, and Migration Plan

...

  • Native 2i
  • Lucene
  • Materialized views
  • SASI

These solutions have the following drawbacks:

...

The SASI architecture was the best out of many that were evaluated, and was used as a starting point for SAI.

Challenges

Having outlined the potential benefits of SAI, there are two considerable architectural challenges it will face.

The first is that because SAI index components are SSTable-attached, its performance will be highly dependent on the compaction strategy in use. The more SSTables SAI must search, the slower local queries will be. This is similar to the general problem in C* around hitting too many SSTables for individual reads, but can be much worse for SAI, unless queries are restricted to partitions or small token ranges. A particularly painful case would be a query without token or partition restrictions on a table that uses LCS, which, by design, maintains hundreds or thousands of small SSTables.

Second, all C* secondary index implementations (and SAI would be no different) make distributed queries on top of the range read mechanism. This is problematic for two reasons. The first is that even at a read CL of ONE/LOCAL_ONE, N (# of nodes) / RF (replication factor) nodes/replicas must be contacted for any non partition/token restricted query. At higher CLs, we can effectively contact the entire cluster. Also, C* range queries are made in a single pass per replicas, with all replicas returning responses containing full rows over the wire rather than row identifiers and sorting information (as systems like SolrCloud/ES do, for instance). Historically, attempts have been made to mitigate this by only incrementally querying the token ring, but making multiple passes when the entire ring must be queried introduces considerable latency. The only way to avoid sending large amounts of data over the wire while simultaneously avoiding multiple query rounds is for query results to turn up a small number of results across the entire ring. This is explained in more detail in this blog post.

The only "easy" way around these two challenges is to focus our efforts on queries that are restricted to either partitions or small token ranges. These queries behave well locally even on LCS (given levels contain token-disjoint SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and all of its secondary pitfalls, and allow us to make queries at varying CLs with reasonable performance. Attempting to fix the local problems around compaction strategy could mean either restricted strategy usage or partially abandoning SSTable-attachment. Attempting to fix distributed read path problems by pushing the design towards IR systems like ES could compromise our ability to use higher read CLs.

Addendum

Index Versioning

Index versioning is supported by including the index version in the on-disk file name.

The index version is mapped to a codec internally to write and read the on-disk index files.

Index Format Version 1

The following applies to the version 1 index format.

JMX metrics

Users will have access to extensive JMX metrics to help them isolate performance issues on their clusters.

There are enough that listing them here would be counterproductive; however they fall into the following groups:

  • Index - Metrics specific to a single index.
  • Column query - Metrics specific to an indexed column.
    • Trie index - Low level metrics for trie indexes.
    • KD-Tree - Low level metrics for kd-tree indexes.
  • Index group - Metrics for shared resources when multiple columns are indexes on the same table.
  • Table query - Metrics for all queries on a single table.
    • Per query - Metrics for queries.

Metrics include items such as: disk usage, memory usage, query latencies, compaction statistics, chunk cache hits/misses/lookups, open files.

Virtual Tables

Sets of the above metrics are also available through virtual tables. These metrics have been grouped into:

  • Index - global index metrics about the index build and query status and on-disk sizing.
  • SSTable - per-SSTable metrics
  • Segment - per-index metrics 

The per-SSTable and per-index metrics relate to the on-disk structures described below.

Terminology

  • Row ID - A monotonic increasing integer associated with every row in a sstable. It’s stored in an index structure instead of key token or key offset, because it compresses better.
  • Postings/posting-list - Sorted row ids that match a given indexed value. 
  • Primary key - A partition key and clustering representing a row in a SSTable
  • Primary key store - A bi-directional block store allowing Row ID → Primary Key and Primary Key → Row ID lookups.
  • Segment - A smallest unit of on-disk indexing structure that is flushed during compaction to reduce memory pressure. Multiple segments of an index are written to the same physical file.

SAI is optimised for storage. The primary key store is written once per SSTable.  Column indexes access the primary key store using a row ID. The primary key store uses an on-disk trie containing primary keys to do the primary key to row id lookups and a prefix-compressed block store for row id to primary key lookups.

Index implementations need only store an integer row ID in their postings list. Row IDs are translated to a primary key via the primary key store.

Image Added

As multiple indexes share the primary key store files, it becomes feasible to index many columns on the same table without significantly increasing the index size.


Indexes may store their data in segments, allowing constrained memory usage and increasing the efficiency of queries where segments can be excluded early.


Index implementation differs depending on the column type being indexed. Currently there are two implementations:

  • Numeric - Based on a kd-tree
  • Text - Based on tries.


Numeric indexes

Numeric types, like timestamp, smallint, and double are indexed on disk as a balanced binary search tree with postings at both internal and leaf nodes. A modified version of the one-dimensional block kd-tree from Lucene is used.


Image Added


Row ID postings are kept separate to the tree to enable efficient retrieval of results in token order. This ordering is required by C*. Leaf and (some) internal nodes have row ID postings lists. 

When querying, if a node is covered by the entire range of the query then the entire row ID postings for that node can be added to the result. 

If a node is only partially covered by the range then the row IDs are filtered and sorted. To filter the row IDs each leaf node maintains a mapping of value to index within the row ID postings list. At most two leaf nodes will need filtering for a bounded query.

Once the set of postings lists are obtained a sorted merge can be used to stream the results in token order.

Numeric Range Performance Query Results

The kdtree postings design yields constant time query performance across all match sizes with the same query limit.

The following table gives an indication of algorithm performance when running queries against a 4 billion row dataset at 200 queries per second, where the number of range matches is given in the left hand column. Tests were performed on an unreleased codebase, however they will be updated as soon as an OSS branch is available, and it is expected that the numbers will change.

Image Added

Text indexes

text, varchar, and ascii types use an inverted index consisting of a dictionary of terms and posting lists for the terms.

The terms dictionary is implemented using a trie data structure which provides excellent term prefix compression and posting file offset is being stored at the leaf node of trie.

Below is a diagram that illustrates the basic structure of the trie. (Note the dotted lines that denote single pages on disk, which allow transitions to be made without additional reads.)

Image Added

Postings are delta encoded and then laid out into either frame of reference (FoR) encoded blocks (when a maximum block size is reached) or blocks of variable width integers (when a complete block, usually the last one, cannot be created).