Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: CEP is adopted

Table of Contents


...

Status

Current state: Under DiscussionAdopted

Discussion thread:  <tbd> intro, cep proposal, demo

JIRACASSANDRA-1850418557

Released:  Proposed for 5.0Unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

  1. Integrate Lucene's HNSW: The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library, which is the best ANN algorithm for Java and currently GA. HNSW provides a fast and efficient solution for finding approximate nearest neighbors in high-dimensional space, making it an ideal choice for this purpose.
  2. Create a new data type: To support the storage of high-dimensional vectors, a new data type will be introduced. This new data type will be represented internally as <TBD> and will enable the handling and storage of Float32 embeddings.
  3. Develop a new SAI index: A new storage-attached index (SAI) called VectorMemtableIndex will be developed to accommodate the ANN search functionality. This index will work in conjunction with the new data type and the HNSW library to enable efficient vector search capabilities within Apache Cassandra.
  4. Introduce a new CQL operator: A new CQL operator, ANN, will be introduced to perform ANN search queries using the newly implemented vector search functionality. This operator will allow users to easily perform ANN searches on their data with a simple and familiar query syntax.
  5. Distributed query:
    1. Coordinator queries all required replicas at once with the requested token range instead of vnode range to reduce round trips and num of local ANN searches;
      1. Required replicas are determined by requested token range and consistency level, just like normal range requests.
      2. Each replica will be queried with requested token range instead of vnode range to reduce number of round trips and num of local ANN searches.
      3. Coordinator will block for all contacted replicas to respond. This can cause memory issue in a large cluster.
    2. Replica collects rows with top-k scores and all tombstones based on local index results;
      1. Every sstable index produces top-k matches and SAI will union them to create result in primary key order.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores. Note that it needs to collect all tombstones as well, it may hit TombstoneOverwhelmingException if there are too many tombstones.
      3. The processor will return top-k rows in primary order to coordinator
    3. Coordinator collects rows with top-k scores and discards rows with lower scores based on reconciled rows from replica responses.
      1. Data resolver reconciles responses from different replicas in primary key order. Short read protection, replica filtering protection and read repair are all disabled because mismatches may be caused by top-k rather than actual data inconsistency.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores
      3. The processor will return top-k rows in primary order to client

About HNSW + Lucene

The Lucene implementation of HNSW (Hierarchical Navigable Small World) provides an efficient approximate nearest neighbor search for high-dimensional vectors in the context of the Lucene search library. At a high level, the Lucene implementation uses a Navigable Small-World graph, which is usually hierarchical but the current implementation has only a single layer. The decision to exclude the hierarchical structure was partly influenced by a paper claiming that there was little benefit from the hierarchy for high-dimensional vectors. However, later discussions noted that the implementation might benefit from the hierarchy, and it is worth considering adding the hierarchical layers to improve search performance and determinism later.

...