Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Implement approximate nearest neighbor (ANN) vector search capability in Apache Cassandra using storage-attached indexes (SAI).
  2. Support Float32 a vector of float32 embeddings as a new CQL type.
  3. Add ANN search to work with normal Cassandra data flow (insertion, updating, and deleting rows). The implementation should support adding a new vector in log(N) time, and ANN queries in M log(N) time where N is the number of vectors and M is the number of sstables.
  4. Compose with other SAI predicates.
  5. Enable Apache Cassandra to be the Vector Search component in ML platforms, and intuitive to use for Data Engineers new to Cassandra.

...


CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE test;

CREATE TABLE test.foo(
i INT PRIMARY KEY,
j VECTOR<float, 3>
); 

CREATE CUSTOM INDEX ann_index ON foo(j) USING 'StorageAttachedIndex';

INSERT INTO test.foo (i, j) VALUES (1, [8, 2.3, 58]);
INSERT INTO test.foo (i, j) VALUES (2, [1.2, 3.4, 5.6]);
INSERT INTO test.foo (i, j) VALUES (5, [23, 18, 3.9]); 

SELECT * FROM test.foo WHERE j ANN OF [3.4, 7.8, 9.1] limit 1;

i  |j
---+---------------------------------------------------------
5  |[23, 18, 3.9] 

...

  1. Verify that ANN search works with normal Cassandra data flow (insertion, updating, and deleting rows).
  2. Test the integration of Lucene's HNSW with the SAI framework.
    1. Verify cross-partition search and validate ANN results
    2. Simulate corrupted stored data vs index data on disk 
  3. Test the new data type (VECTOR<type, dimension>) and CQL operator (ANN) with various use cases.
  4. Evaluate the performance of the new features and their impact on existing Cassandra setups.

...