Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

Cassandra Configuration Files

JoeWinter edited this page Mar 26, 2015 · 19 revisions

[Table of Contents](https://github.com/dell-oss/Doradus/wiki/Doradus Administration: Table-of-Contents) | [Previous](https://github.com/dell-oss/Doradus/wiki/Cassandra Script Files) | [Next](https://github.com/dell-oss/Doradus/wiki/Monitoring Cassandra)
[Cassandra Configuration and Operation](https://github.com/dell-oss/Doradus/wiki/Cassandra Configuration and Operation): Cassandra Configuration Files


The following are the primary Cassandra configuration files in the {cassandra_home}/conf folder:

  • cassandra.yaml: This file provides global configuration options. There are a lot of options in this file, but only a few typically require modification from their defaults.

  • log4j-server.properties: This file controls logging features of the Cassandra instance.

The following sections describe the options most commonly modified.

Setting Cassandra Logging Options

Cassandra logs messages about its operation using the log4j logging facility. The file that controls logging options is {cassandra_home}/conf/log4j-server.properties. The one option you should change is shown below:

log4j.appender.R.File=/var/log/cassandra/system.log

You should change this line to use a valid folder path where logs will be kept. The initial log file will be called system.log. When a log file becomes full (default is 20MB), it is renamed with a numeric suffix (e.g., ".1", “.2”). When 50 files are created, the oldest files are removed.

The other option you may want to modify under specific situations is:

log4j.rootLogger=INFO,stdout,R

This line sets the global logging level to INFO. In certain diagnostic situations, you might want to change the INFO keyword to DEBUG to generate additional logging messages. Note that DEBUG log level causes log files to grow quickly and slow down Cassandra.

Setting Cassandra Data File Locations

Cassandra creates three kinds of data files: commit logs, SSTables, and saved caches. (Somewhat confusingly, the SSTable files are sometimes called "data files" even though all three kinds of files hold data.) The folder path of each file kind is defined in cassandra.yaml. You should change all of the following folder options:

commitlog_directory: /var/lib/cassandra/commitlog

Set this option to the folder where commit logs should be stored. It should be a different disk than any of the SSTable disks (see below). Multiple commit logs are created in this folder, but they are deleted when they become obsolete, so typically commit logs do not require a lot of space.

data_file_directories:

- /var/lib/cassandra/data

Set this option to at least one root folder where SSTable files are to be stored. SSTables are the primary files containing application data. Each folder is listed on a secondary line, indented and beginning with a dash. Multiple data folders are recommended for better performance (see below).

saved_caches_directory: /var/lib/cassandra/saved_caches

Set this option to a valid folder name where Cassandra will save key and row caches that it builds. It can be the same disk as the commit log or where software is installed, but it shouldn’t be one of the SSTable disks. The size of disk space for caches depend on cache option settings.

When updates are sent to Cassandra, they are first written to a commit log file. The commit files are "replayed" when a restart occurs, thereby providing recovery for updates that may not have been written to an SSTable file. Because commit logs are removed when they are no longer needed, they typically do not use much disk space.

After updates are written to the commit log, they are stored in memory and eventually sorted and flushed to disk as SSTables. Each SSTable is represented by multiple files including data, hash, and index files. When Cassandra is configured with multiple data file directories, it flushes each SSTable to the directory that has the most available space. Therefore, best practices for the commit log and SSTable files are:

  1. Each SSTable folder should reside on a separate disk. This allows concurrent I/Os: a separate I/O can be initiated for each disk.

  2. Each SSTable disk should be of the same size and used solely for SSTables. This prevents disk contention with other files, and it allows all disks to grow at the same rate.

  3. The commit log folder should reside on its own disk. Because data is flushed quickly as it is received, the commit log folder can receive a high volume of I/O, hence it should use its own disk to prevent contention with SSTable files. The disk does not have to be large since commit logs are discarded fairly quickly.

Cassandra Security Options

As described in the Security Considerations section, Cassandra does not encrypt its commit log or data files. Disk files must be protected through operating system-level file security or encryption (e.g., Windows EFS). This section describes how to protect Cassandra’s communication protocols: Thrift, CQL, JMX, and Gossip.

Securing the Cassandra API

Either the Cassandra Thrift or CQL API can be used by Doradus. By default, both APIs use an unencrypted connection and allow any process to connect and authenticate. To prevent unauthorized applications from directly accessing Cassandra, you can enable TLS.

The general steps for enabling TLS are described below:

  1. In the cassandra.yaml file on each Cassandra node: enable TLS by setting enabled to true under client_encryption_options. Require client authentication by setting require_client_auth to true. When client authentication is enabled, the truststore and truststore_password options must also be set. Finally, cipher_suites should be set to one or more cipher suites that are accessible to the JRE. An example of the required settings is shown below:

     client_encryption_options:
        enabled: true
        keystore: ../conf/.keystore
        keystore_password: cassandra
        require_client_auth: true
        truststore: conf/.truststore
        truststore_password: cassandra
        cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA]
    
  2. Create a certificate that will be used by Doradus and add it to the Cassandra truststore on each node.

  3. In the doradus.yaml file for each Doradus instance: enable TLS by setting dbtls to true. Finally, set dbtls_cipher_suites to the same cipher(s) defined for Cassandra. An example of these settings is shown below:

    dbtls: true
    dbtls_cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA]
    
  4. Add the certificate created in step 2 to the keystore for each Doradus instance. This requires that the keystore and keystorepassword in each doradus.yaml file is also set.

More information about enabling TLS and creating certificates can be found in documentation such as the following: http://www.datastax.com/documentation/cassandra/2.0/cassandra/security/secureSSLClientToNode_t.html

Securing the Cassandra JMX Protocol

Cassandra supports the Java Management Extensions (JMX) protocol for monitoring and controlling Cassandra processes. JMX can be secured with authorization and/or encryption, however, JMX security is disabled in the {cassandra_home}/bin/cassandra.bat file that is included with Doradus. The common JMX options are defined in the JAVA_OPTS environment variable as shown below:

REM ***** JAVA options *****
set JAVA_OPTS=-ea^
 ...
 -Dcom.sun.management.jmxremote.**port**=7199^
 -Dcom.sun.management.jmxremote.**ssl**=false^
 -Dcom.sun.management.jmxremote.**authenticate**=false^
 ...

These options are summarized below:

  • port: This sets the port number that the remote JMX clients must use. As shown, port 7199 is the default used for Cassandra.

  • ssl: When set to true, this option requires remote clients to use SSL (TLS) to connect to the JMX port. When SSL is enabled, additional options are available to require remote clients to have a client-side certificate.

  • authenticate: When set to true, this option requires remote clients to provide a user ID and password in order to connection. Additional parameters are required to define allowed user IDs and passwords and the locations of the corresponding files.

Because JMX is a standard and external documentation is available for securing JMX, details for using SSL and/or authentication are not covered here. See for example the following:

http://docs.oracle.com/javase/6/docs/technotes/guides/management/agent.html

Though not shown above, here is another useful option: On multi-homed systems, the define java.rmi.server.hostname can be set to cause JMX to bind to a specific IP address instead of the default one. For example:

set JAVA_OPTS=-Djava.rmi.server.hostname=10.1.82.121^
...

This causes the JMX port to bind to address 10.1.82.121.

Securing the Cassandra Gossip Protocol

In a multi-node cluster, each Cassandra node communicates with peer nodes using the Gossip protocol. For non-encrypted connections, the Gossip protocol uses a TCP port defined by the following cassandra.yaml option:

storage_port: 7000

When SSL is enabled for the Gossip protocol, the following cassandra.yaml file option defines the port number used:

ssl_storage_port: 7001

All nodes in a cluster should be configured to use the same storage_port and ssl_storage_port. To prevent eavesdropping or unauthorized disruptions, the gossip protocol should be secured in production environments. However, because the protocol is used for high-performance operations such as replicating data between nodes, encryption is not recommended except for communication between remote locations.

For co-located nodes, the easiest way to secure the Gossip API is to deploy all Cassandra nodes on the same subnet and disallow access to the Gossip port from outside the subnet.

In large Cassandra deployments where multiple "racks" or “data centers” are deployed, each having some number of Cassandra nodes, the Gossip protocol can be secured for cross-rack or cross-data center communication. This is done with the following options in the cassandra.yaml file:

encryption_options:
	internode_encryption: none
	keystore: conf/.keystore
	keystore_password: cassandra
	truststore: conf/.truststore
	truststore_password: cassandra

Internode encryption (over the Gossip API) is enabled or disabled by the setting of the internode_encryption option. The following options are recognized:

  • none: This disables all inter-node encryption, meaning Cassandra nodes use unencrypted communication using the defined storage_port.

  • all: This enables encryption for all inter-node communication using the defined ssl_storage_port.

  • rack: This uses non-encrypted communication for nodes defined to be in the same rack (cabinet) and encrypted communication between nodes defined to be in different racks.

  • dc: This uses non-encrypted communication for nodes defined to be in the same data center and encrypted communication between nodes defined to be in different data centers.

When any encryption is enabled for the Gossip protocol, all authentication, key exchange, and data transfer occurs with TLS v1 using RSA 1024 bit keys. This encryption suite is referred to as TLS_RSA_WITH_AES_128_CBC_SHA. This requires that keystore and truststore files are defined and initialized. These files are password-protected using the keystore_password and truststore_password options. Instructions for creating these files can be found publicly, such as in this link:

http://docs.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore

Configuring Cassandra for Clusters

By default, Cassandra assumes that it is operating as a stand-alone node. It must be configured to operate in a cluster. The following cassandra.yaml options affect a node’s participation in a cluster:

  • cluster_name: All nodes in the cluster must have the same name, which differentiates the cluster from other nodes that might be working in the same network or even on the same machine. The default name is "Test Cluster", so you should change this to something else like “Doradus Cluster”.

  • initial_token: This value defines the beginning range of key values for which the node will be the primary owner. It is not set by default, and it may be valid to leave it unset when configuring a new node. However, for a "balanced" cluster, you will need to set this value for each node.

  • seeds: Seeds are IP addresses of neighboring nodes that this node can contact using the gossip protocol. The addresses provide only an initial set: after a node is running, it will memorize the addresses of other nodes in the network and contact them when necessary. The seeds are therefore necessary for the initial execution of a new node. Cassandra provides a generalized "seed provider" interface, but the built-in “simple seed provider” is sufficient for most situations.

  • listen_address: This is the IP address that tells other nodes what IP address to use to communicate to this node. To participate in a cluster, you must change this from its default of "localhost". A host name can be used but is not recommended. The “any address” 0.0.0.0 will not work. You should use a static IP address visible to all other nodes. If the machine is multi-homed, a non-externally visible address (192.x or 10.x) is a good choice.

  • partitioner: Beginning with the Cassandra 1.2 release, the default for this parameter is now Murmur3Partitioner. This random partitioning algorithm is more efficient than the older RandomPartitioner scheme, although the two are incompatible. All nodes in the cluster should use the same partitioning scheme. If you upgrade from an older Cassandra release, you’ll need to ensure this parameter matches your existing value.

For more details on Cassandra configuration options, see http://wiki.apache.org/cassandra/Operations. The Wiki site also has information on topics such as:

  • Adding new nodes to an existing cluster

  • Migrating from the older to newer random partitioning scheme

  • Recovering a node that has died

  • Removing a node from the cluster

  • Changing a cluster’s replication factor

  • Deploying larger clusters within multiple racks (cabinets) and even data centers

In addition to the Wiki site, there are several online sources and books on Cassandra configuration such as:

Other Cassandra Configuration Options

In addition to the options described in this section, there are other options in the cassandra.yaml file that you might want to change in certain circumstances. Here a list of the most common options:

  • concurrent_reads: This value controls how many outstanding read operations are allowed at once. A recommended value is 16 times the number of data disks used.

  • concurrent_writes: This value controls how many outstanding write operations are allowed at once. A recommended value is 8 times the number of cores present on the machine.

The next option is common to both the Thrift and CQL APIs:

  • rpc_address: This value controls the address(es) to which Cassandra binds when listening for client connections. The same value is used to control both the Thrift and CQL APIs. The value localhost will allow only local connections. A specific IP address can be used, or the address 0.0.0.0 can be used to cause Cassandra to accept connections on all network interfaces.

The next options are specific to the Thrift API:

  • start_rpc: When true, this option causes Cassandra to initialize the Thrift API.

  • rpc_port: This is the Thrift API port that applications such as Doradus connect to. You can change it from its default of 9160, but it should be the same on all nodes. (And you must configure Doradus to know what port to use.)

  • thrift_framed_transport_size_in_mb: This value controls the maximum size of a Thrift message that Cassandra will accept. The default (16MB) is often too small for Doradus applications that use "batch updates". A good idea is to increase this to 160MB.

The next options are specific to the CQL API:

  • start_native_transport: When this option is true, Cassandra initializes the CQL API.

  • native_transport_port: This is the CQL API port that applications such as Doradus connect to. You can change it from its default of 9042, but it should be the same on all nodes. If you configure Doradus to use the CQL API, its configuration must match this value.

Clone this wiki locally