In this tutorial, you'll learn how to setup a very simple Spark application for reading and writing data from/to Cassandra. Before you start, you need to have basic knowledge of Apache Cassandra and Apache Spark. Refer to Datastax and Cassandra documentation and Spark documentation.
Install and launch a Cassandra cluster and a Spark cluster.
Configure a new Scala project with the Apache Spark and dependency.
The dependencies are easily retrieved via the spark-packages.org website. For example, if you're using sbt
, your build.sbt should include something like this:
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies += "datastax" % "spark-cassandra-connector" % "2.0.1-s_2.11"
The spark-packages libraries can also be used with spark-submit and spark shell, these commands will place the connector and all of its dependencies on the path of the Spark Driver and all Spark Executors.
$SPARK_HOME/bin/spark-shell --packages datastax:spark-cassandra-connector:2.0.1-s_2.11
$SPARK_HOME/bin/spark-submit --packages datastax:spark-cassandra-connector:2.0.1-s_2.11
For the list of available versions, see:
This driver does not depend on the Cassandra server code.
- For a detailed dependency list, see project/CassandraSparkBuild.scala
- For dependency versions, see project/Versions.scala
Create a simple keyspace and table in Cassandra. Run the following statements in cqlsh
:
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
CREATE TABLE test.kv(key text PRIMARY KEY, value int);
Then insert some example data:
INSERT INTO test.kv(key, value) VALUES ('key1', 1);
INSERT INTO test.kv(key, value) VALUES ('key2', 2);
Now you're ready to write your first Spark program using Cassandra.
Run the spark-shell
with the packages line for your version. To configure
the default Spark Configuration pass key value pairs with --conf
$SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11
This command would set the Spark Cassandra Connector parameter
spark.cassandra.connection.host
to 127.0.0.1
. Change this
to the address of one of the nodes in your Cassandra cluster.
Enable Cassandra-specific functions on the SparkContext
, SparkSession
, RDD
, and DataFrame
:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
Use the sc.cassandraTable
method to view this table as a Spark RDD
:
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)
Add two more rows to the table:
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
Next - Connecting to Cassandra Jump to - Accessing data with DataFrames