Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state

DiscussAccepted

Discussion thread

https://lists.apache.org/thread/lrww4d7cdxgtg8o3gt8b8foymzpvq7z3

JIRA

Jira
serverASF JIRA

...

serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyCASSANDRA-16222

Source Code

Released

03/22/2023

Motivation

...


Dataset<Row> df = getInputDataset();

df.write()
  .format("org.apache.cassandra.spark.bulkwriter.CassandraBulkSource")
  .option("CASSANDRASIDECAR_HOSTSINSTANCES", "127.0.0.1,127.0.0.2localhost,localhost2") // Provide at least one CassandraSidecar host to which we can connect
  .option("KEYSPACE", "spark_test")
  .option("TABLE", "student")
   .option("BULK_WRITER_CL", IConsistencyLevel.CL."LOCAL_QUORUM.name("))
   .option("LOCAL_DC", "DC1")
   .option("KEYSTORE_PATH", "/path/to/keystore")
   .option("KEYSTORE_PASSWORD", getKeystorePassFromSafePlace())
   .option("KEYSTORE_TYPE", "PKCS12")
   .option("CASSANDRA_SSL_ENABLED", "true")
  .mode("append")
  .save();

...


final Dataset<Row> df = SQLContext.getOrCreate(sc).read()
    .format("org.apache.cassandra.spark.bulkreader.CassandraDataSource")
    .option("cassandrasidecar_hostsinstances", "127.0.0.1,127.0.0.2localhost,localhost2") // Provide at least one CassandraSidecar host to which we can connect
    .option("keyspace", "my_keyspace")
    .option("table", "my_table")
    .option("DC", "DC1")
    .option("snapshotName", "my_sbr_snapshot_123")
    .option("createSnapshot", **true**)
    .option("defaultParallelism", sc.defaultParallelism())
    .option("numCores", numCores)
    .load();
    
// sum entire dataset on column 'c'
final long result = df.agg(sum("c")).first().getLong(0);

...

Additionally, the SBR benefits from the fact that the Cassandra codebase can always read SSTables in the previous major version format. If Cassandra supported both reading from and writing to the previous major SSTable format, we would be able to remove the dependency on embedding multiple versions of the cassandra-all  jar into the solution in order to support mixed-mode clusters, as you would see during major version upgrades.

Architecture Diagrams/Overview of Data Flow

(NOTE: Click to enlarge)

Image Added

Image Added

New or Changed Public Interfaces

...