The Apache Cassandra to Cloud Bigtable template copies a table from Apache Cassandra to Cloud Bigtable. This template requires minimal configuration and replicates the table structure in Cassandra as closely as possible in Cloud Bigtable.
The Apache Cassandra to Cloud Bigtable template is useful for the following:
- Migrating Apache Cassandra database when short downtime is acceptable.
- Periodically replicating Cassandra tables to Cloud Bigtable for global serving.
📝 This is a Google-provided template! Please check Provided templates documentation on how to use it without having to build from sources using Create job from template.
💡 This is a generated documentation based on Metadata Annotations . Do not change this file directly.
- cassandraHosts : The hosts of the Apache Cassandra nodes in a comma-separated list.
- cassandraKeyspace : The Apache Cassandra keyspace where the table is located.
- cassandraTable : The Apache Cassandra table to copy.
- bigtableProjectId : The Google Cloud project ID associated with the Bigtable instance.
- bigtableInstanceId : The ID of the Bigtable instance that the Apache Cassandra table is copied to.
- bigtableTableId : The name of the Bigtable table that the Apache Cassandra table is copied to.
- cassandraPort : The TCP port to use to reach Apache Cassandra on the nodes. The default value is 9042.
- defaultColumnFamily : The name of the column family of the Bigtable table. The default value is default.
- rowKeySeparator : The separator used to build row-keys. The default value is '#'.
- splitLargeRows : The flag for enabling splitting of large rows into multiple MutateRows requests. Note that when a large row is split between multiple API calls, the updates to the row are not atomic. .
- writetimeCassandraColumnSchema : GCS path to schema to copy Cassandra writetimes to Bigtable. The command to generate this schema is
cqlsh -e "select json * from system_schema.columns where keyspace_name='$CASSANDRA_KEYSPACE' and table_name='$CASSANDRA_TABLE'`" > column_schema.json
. Set $WRITETIME_CASSANDRA_COLUMN_SCHEMA to a GCS path, e.g.gs://$BUCKET_NAME/column_schema.json
. Then upload the schema to GCS:gcloud storage cp column_schema.json $WRITETIME_CASSANDRA_COLUMN_SCHEMA
. Requires Cassandra version 2.2 onwards for JSON support. - setZeroTimestamp : The flag for setting Bigtable cell timestamp to 0 if Cassandra writetime is not present. The default behavior for when this flag is not set is to set the Bigtable cell timestamp as the template replication time, i.e. now.
- Java 11
- Maven
- gcloud CLI, and execution of the
following commands:
gcloud auth login
gcloud auth application-default login
🌟 Those dependencies are pre-installed if you use Google Cloud Shell!
This README provides instructions using the Templates Plugin.
This template is a Classic Template, meaning that the pipeline code will be executed only once and the pipeline will be saved to Google Cloud Storage for further reuse. Please check Creating classic Dataflow templates and Running classic templates for more information.
If the plan is to just stage the template (i.e., make it available to use) by
the gcloud
command or Dataflow "Create job from template" UI,
the -PtemplatesStage
profile should be used:
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
mvn clean package -PtemplatesStage \
-DskipTests \
-DprojectId="$PROJECT" \
-DbucketName="$BUCKET_NAME" \
-DstagePrefix="templates" \
-DtemplateName="Cassandra_To_Cloud_Bigtable" \
-f v1
The -DgcpTempLocation=<temp-bucket-name>
parameter can be specified to set the GCS bucket used by the DataflowRunner to write
temp files to during serialization. The path used will be gs://<temp-bucket-name>/temp/
.
The command should build and save the template to Google Cloud, and then print the complete location on Cloud Storage:
Classic Template was staged! gs://<bucket-name>/templates/Cassandra_To_Cloud_Bigtable
The specific path should be copied as it will be used in the following steps.
Using the staged template:
You can use the path above run the template (or share with others for execution).
To start a job with the template at any time using gcloud
, you are going to
need valid resources for the required parameters.
Provided that, the following command line can be used:
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
export REGION=us-central1
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/Cassandra_To_Cloud_Bigtable"
### Required
export CASSANDRA_HOSTS=<cassandraHosts>
export CASSANDRA_KEYSPACE=<cassandraKeyspace>
export CASSANDRA_TABLE=<cassandraTable>
export BIGTABLE_PROJECT_ID=<bigtableProjectId>
export BIGTABLE_INSTANCE_ID=<bigtableInstanceId>
export BIGTABLE_TABLE_ID=<bigtableTableId>
### Optional
export CASSANDRA_PORT=9042
export DEFAULT_COLUMN_FAMILY=default
export ROW_KEY_SEPARATOR="#"
export SPLIT_LARGE_ROWS=<splitLargeRows>
export WRITETIME_CASSANDRA_COLUMN_SCHEMA=<writetimeCassandraColumnSchema>
export SET_ZERO_TIMESTAMP=false
gcloud dataflow jobs run "cassandra-to-cloud-bigtable-job" \
--project "$PROJECT" \
--region "$REGION" \
--gcs-location "$TEMPLATE_SPEC_GCSPATH" \
--parameters "cassandraHosts=$CASSANDRA_HOSTS" \
--parameters "cassandraPort=$CASSANDRA_PORT" \
--parameters "cassandraKeyspace=$CASSANDRA_KEYSPACE" \
--parameters "cassandraTable=$CASSANDRA_TABLE" \
--parameters "bigtableProjectId=$BIGTABLE_PROJECT_ID" \
--parameters "bigtableInstanceId=$BIGTABLE_INSTANCE_ID" \
--parameters "bigtableTableId=$BIGTABLE_TABLE_ID" \
--parameters "defaultColumnFamily=$DEFAULT_COLUMN_FAMILY" \
--parameters "rowKeySeparator=$ROW_KEY_SEPARATOR" \
--parameters "splitLargeRows=$SPLIT_LARGE_ROWS" \
--parameters "writetimeCassandraColumnSchema=$WRITETIME_CASSANDRA_COLUMN_SCHEMA" \
--parameters "setZeroTimestamp=$SET_ZERO_TIMESTAMP"
For more information about the command, please check: https://cloud.google.com/sdk/gcloud/reference/dataflow/jobs/run
Using the plugin:
Instead of just generating the template in the folder, it is possible to stage and run the template in a single command. This may be useful for testing when changing the templates.
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
export REGION=us-central1
### Required
export CASSANDRA_HOSTS=<cassandraHosts>
export CASSANDRA_KEYSPACE=<cassandraKeyspace>
export CASSANDRA_TABLE=<cassandraTable>
export BIGTABLE_PROJECT_ID=<bigtableProjectId>
export BIGTABLE_INSTANCE_ID=<bigtableInstanceId>
export BIGTABLE_TABLE_ID=<bigtableTableId>
### Optional
export CASSANDRA_PORT=9042
export DEFAULT_COLUMN_FAMILY=default
export ROW_KEY_SEPARATOR="#"
export SPLIT_LARGE_ROWS=<splitLargeRows>
export WRITETIME_CASSANDRA_COLUMN_SCHEMA=<writetimeCassandraColumnSchema>
export SET_ZERO_TIMESTAMP=false
mvn clean package -PtemplatesRun \
-DskipTests \
-DprojectId="$PROJECT" \
-DbucketName="$BUCKET_NAME" \
-Dregion="$REGION" \
-DjobName="cassandra-to-cloud-bigtable-job" \
-DtemplateName="Cassandra_To_Cloud_Bigtable" \
-Dparameters="cassandraHosts=$CASSANDRA_HOSTS,cassandraPort=$CASSANDRA_PORT,cassandraKeyspace=$CASSANDRA_KEYSPACE,cassandraTable=$CASSANDRA_TABLE,bigtableProjectId=$BIGTABLE_PROJECT_ID,bigtableInstanceId=$BIGTABLE_INSTANCE_ID,bigtableTableId=$BIGTABLE_TABLE_ID,defaultColumnFamily=$DEFAULT_COLUMN_FAMILY,rowKeySeparator=$ROW_KEY_SEPARATOR,splitLargeRows=$SPLIT_LARGE_ROWS,writetimeCassandraColumnSchema=$WRITETIME_CASSANDRA_COLUMN_SCHEMA,setZeroTimestamp=$SET_ZERO_TIMESTAMP" \
-f v1
Dataflow supports the utilization of Terraform to manage template jobs, see dataflow_job.
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters specific to the template. If available, they may be used instead of dataflow_job directly.
To use the autogenerated module, execute the standard terraform workflow:
cd v1/terraform/Cassandra_To_Cloud_Bigtable
terraform init
terraform apply
To use dataflow_job directly:
provider "google-beta" {
project = var.project
}
variable "project" {
default = "<my-project>"
}
variable "region" {
default = "us-central1"
}
resource "google_dataflow_job" "cassandra_to_cloud_bigtable" {
provider = google-beta
template_gcs_path = "gs://dataflow-templates-${var.region}/latest/Cassandra_To_Cloud_Bigtable"
name = "cassandra-to-cloud-bigtable"
region = var.region
temp_gcs_location = "gs://bucket-name-here/temp"
parameters = {
cassandraHosts = "<cassandraHosts>"
cassandraKeyspace = "<cassandraKeyspace>"
cassandraTable = "<cassandraTable>"
bigtableProjectId = "<bigtableProjectId>"
bigtableInstanceId = "<bigtableInstanceId>"
bigtableTableId = "<bigtableTableId>"
# cassandraPort = "9042"
# defaultColumnFamily = "default"
# rowKeySeparator = ""#""
# splitLargeRows = "<splitLargeRows>"
# writetimeCassandraColumnSchema = "<writetimeCassandraColumnSchema>"
# setZeroTimestamp = "false"
}
}