Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added Discuss link and note on node testing.

Status

Current state

DraftDiscuss

Discussion thread

https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3

JIRA

Released

03/22/2023

Motivation

...

  • There are many features currently built into the Sidecar that could likely be accomplished in Cassandra directly with minimal impact, but it wasn’t feasible when the tool was initially developed and having these features in the Sidecar have other benefits, especially around decoupling these tools from the main Cassandra release cycle and also isolating this functionality from interfering with operations in the Cassandra process itself.

    • Upload/stage SSTables & coordinate import in Cassandra itself
      • This would allow Spark tasks to upload SSTables to only one node and then have that node coordinate the import, validating the desired consistency level was reached
      • This would also significantly reduce the bandwidth requirements between Spark and Cassandra, as today these files are uploaded to every replica
    • Stream SSTable snapshots directly from Cassandra to the Bulk Reader
      • If Cassandra could support range read requests, the Bulk Reader could create a snapshot and then read the data directly from Cassandra
  • We currently don’t natively support vnodes in any way, but it should be possible to build vnode support into the system in the futureWhile there is nothing inherent in the solution to prevent support for vnodes, they are not currently tested as the testing infrastructure doesn't (yet) support them. Work is ongoing to remove this limitation in the testing infrastructure at which point we should be able to officially test and support vnodes.

Rejected Alternatives

  • Spark Cassandra Connector, as it is significantly slower compared to directly reading and writing SSTables in Spark. The library provides an order-of-magnitude speed up compared to Cassandra reads/writes performed through the default CQL driver library and the Spark Cassandra Connector, with reads and writes several folds faster.
  • Modifications to allow this kind of bulk loading directly to the Cassandra server itself. While we now have zero-copy streaming available in Cassandra, and it would perhaps be possible to use some of that code to be leveraged to reduce the impact on C*, there was no way to do this when the library was initially created (Cassandra 2.1 days) it there may still be good reasons to isolate the uploads from the main Cassandra process. However, this would require a significant rework of the Cassandra Native protocol. While it is theoretically feasible, practically it will be a massive change that may create issues for existing users to migrate over. We might pursue this in the future.
  • Modifications to nodetool import to do “coordinated imports,” mostly due to time constraints. It is likely there could be some value in having Cassandra coordinate imports into multiple instances at once and manage the consistenty level checks. Additionally, it may be possible to use the knowledge that all instances accepted a particular sstable import to mark sstables as repaired, which would cut down on post-import repair.