Status
Current state
DiscussAccepted
Discussion thread
https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3
JIRA
Jira | ||
---|---|---|
|
...
|
Source Code
- Apache Cassandra Spark Analytics source code: https://github.com/frankgh/cassandra-analytics
- Changes required for Sidecar: https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
Released
03/22/2023
Motivation
...
Dataset<Row> df = getInputDataset();
df.write()
.format("org.apache.cassandra.spark.bulkwriter.CassandraBulkSource")
.option("CASSANDRASIDECAR_HOSTSINSTANCES", "127.0.0.1,127.0.0.2localhost,localhost2") // Provide at least one CassandraSidecar host to which we can connect
.option("KEYSPACE", "spark_test")
.option("TABLE", "student")
.option("BULK_WRITER_CL", IConsistencyLevel.CL."LOCAL_QUORUM.name("))
.option("LOCAL_DC", "DC1")
.option("KEYSTORE_PATH", "/path/to/keystore")
.option("KEYSTORE_PASSWORD", getKeystorePassFromSafePlace())
.option("KEYSTORE_TYPE", "PKCS12")
.option("CASSANDRA_SSL_ENABLED", "true")
.mode("append")
.save();
...
final Dataset<Row> df = SQLContext.getOrCreate(sc).read()
.format("org.apache.cassandra.spark.bulkreader.CassandraDataSource")
.option("cassandrasidecar_hostsinstances", "127.0.0.1,127.0.0.2localhost,localhost2") // Provide at least one CassandraSidecar host to which we can connect
.option("keyspace", "my_keyspace")
.option("table", "my_table")
.option("DC", "DC1")
.option("snapshotName", "my_sbr_snapshot_123")
.option("createSnapshot", **true**)
.option("defaultParallelism", sc.defaultParallelism())
.option("numCores", numCores)
.load();
// sum entire dataset on column 'c'
final long result = df.agg(sum("c")).first().getLong(0);
...
Additionally, the SBR benefits from the fact that the Cassandra codebase can always read SSTables in the previous major version format. If Cassandra supported both reading from and writing to the previous major SSTable format, we would be able to remove the dependency on embedding multiple versions of the cassandra-all
jar into the solution in order to support mixed-mode clusters, as you would see during major version upgrades.
Architecture Diagrams/Overview of Data Flow
(NOTE: Click to enlarge)
New or Changed Public Interfaces
...