Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents


...

Status

Current state:  Under DiscussionAdopted

Discussion threaddev discussion

...

  • Replace a search engine like Elastic or Solr
  • Tokenization
  • Prefix/wildcard/regex
  • Geo
  • Sorting


Approach

Add a new index implementation StorageAttachedIndex (SAI).

...

  • AND queries
  • Numeric range queries
  • Non-variable length numeric types
  • Text type indexes and equality queries
  • SSTable attached
  • Optional case sensitivity
  • Optional unicode normalizationPartition based iteration
  • Tokenization
  • Index versioning
  • Row-aware query path

Version 2 Features

  • Row iteration
  • OR queries
  • Tokenization
  • Prefix LIKE

Indexing options

  • OR Queries
  • Range queries on all types
  • Global sorting
  • case_sensitive - If the index is case sensitive or not.
  • normalize - If the index uses unicode normalization or not.

Write path

Write path is mostly the same as SASI where multiple column indexes are attached to a single memtable.

During flush, SAI will make use of the index memtable to generate an on-disk index file to avoid re-indexing the flushed sstable twice.

On-disk versioning of indexes is supported by embedding version information in the on-disk file format.

Read path

The read path in SAI is similar to the SASI read path with a merge of postings from the in-memory and SSTable indexes using the RangeIterator framework.

...

The only "easy" way around these two challenges is to focus our efforts on queries that are restricted to either partitions or small token ranges. These queries behave well locally even on LCS (given levels contain token-disjoint SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and all of its secondary pitfalls, and allow us to make queries at varying CLs with reasonable performance. Attempting to fix the local problems around compaction strategy could mean either restricted strategy usage or partially abandoning SSTable-attachment. Attempting to fix distributed read path problems by pushing the design towards IR systems like ES could compromise our ability to use higher read CLs.

Addendum

Index Versioning

Index versioning is supported by including the index version in the on-disk file name.

The index version is mapped to a codec internally to write and read the on-disk index files.

Index Format Version 1

The following applies to the version 1 index format.

...

Metrics include items such as: disk usage, memory usage, query latencies, compaction statistics, chunk cache hits/misses/lookups, open files.

Virtual Tables

Sets of the above metrics are also available through virtual tables. These metrics have been grouped into:

  • Index - global index metrics about the index build and query status and on-disk sizing.
  • SSTable - per-SSTable metrics
  • Segment - per-index metrics 

The per-SSTable and per-index metrics relate to the on-disk structures described below.

Terminology

  • Row ID - A monotonic increasing integer associated with every row in a sstable. It’s stored in an index structure instead of key token or key offset, because it compresses better.
  • Postings/posting-list - Sorted row ids that match a given indexed value. 
  • Token file - An index of Row ID -> partition key token for every row in the sstable.
  • Primary key - A partition key and clustering representing a row in a SSTable
  • Primary key store - A bi-directional block store allowing Row ID → Primary Key and Primary Key → Row ID lookupsOffset file - An index of Row ID -> partition key offset on the data/primary-index file for every row in the sstable.
  • Segment - A smallest unit of on-disk indexing structure that is flushed during compaction to reduce memory pressure. Multiple segments of an index are written to the same physical file.

SAI is optimised for storage. Tokens and offsets are stored The primary key store is written once per SSTable.  Column indexes access the token and offset files primary key store using a row ID. Offsets are compressed using Frame of Reference (FoR) encoding while tokens are not because tokens consume the full 8 bytes and therefore cannot be compressedThe primary key store uses an on-disk trie containing primary keys to do the primary key to row id lookups and a prefix-compressed block store for row id to primary key lookups.

Index implementations need only store an integer row ID in their postings list. Row IDs are translated to decorated a primary key via the token/offset files and SSTableReader#keyAtprimary key store.

Image RemovedImage Added

As multiple indexes share the token/offset primary key store files, it becomes feasible to index many columns on the same table without significantly increasing the index size.

...

Postings are delta encoded and then laid out into either frame of reference (FoR) encoded blocks (when a maximum block size is reached) or blocks of variable width integers (when a complete block, usually the last one, cannot be created).Tokenization is out of scope; however, this design does not hinder its implementation.