Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add more details to vector distributed search

...

  1. Integrate Lucene's HNSW: The implementation will leverage Lucene's Hierarchical Navigable Small World (HNSW) library, which is the best ANN algorithm for Java and currently GA. HNSW provides a fast and efficient solution for finding approximate nearest neighbors in high-dimensional space, making it an ideal choice for this purpose.
  2. Create a new data type: To support the storage of high-dimensional vectors, a new data type will be introduced. This new data type will be represented internally as <TBD> and will enable the handling and storage of Float32 embeddings.
  3. Develop a new SAI index: A new storage-attached index (SAI) called VectorMemtableIndex will be developed to accommodate the ANN search functionality. This index will work in conjunction with the new data type and the HNSW library to enable efficient vector search capabilities within Apache Cassandra.
  4. Introduce a new CQL operator: A new CQL operator, ANN, will be introduced to perform ANN search queries using the newly implemented vector search functionality. This operator will allow users to easily perform ANN searches on their data with a simple and familiar query syntax.
  5. Distributed query:
    1. Coordinator queries all required replicas at once with the requested token range instead of vnode range to reduce round trips and num of local ANN searches;
      1. Required replicas are determined by requested token range and consistency level, just like normal range requests.
      2. Each replica will be queried with requested token range instead of vnode range to reduce number of round trips and num of local ANN searches.
      3. Coordinator will block for all contacted replicas to respond. This can cause memory issue in a large cluster.
    2. Replica collects rows with top-k scores and all tombstones based on local index results;
      1. Every sstable index produces top-k matches and SAI will union them to create result in primary key order.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores. Note that it needs to collect all tombstones as well, it may hit TombstoneOverwhelmingException if there are too many tombstones.
      3. The processor will return top-k rows in primary order to coordinator
    3. Coordinator collects rows with top-k scores and discards rows with lower scores based on reconciled rows from replica responses.
      1. Data resolver reconciles responses from different replicas in primary key order. Short read protection, replica filtering protection and read repair are all disabled because mismatches may be caused by top-k rather than actual data inconsistency.
      2. The processor uses PriorityQueue to store top-k rows and discard rows with lower scores
      3. The processor will return top-k rows in primary order to client

...