Skip to content

Latest commit

 

History

History
37 lines (24 loc) · 1.81 KB

1.0-partitioning.md

File metadata and controls

37 lines (24 loc) · 1.81 KB

Partitioning

http://docs.datastax.com/en/archived/cassandra/1.0/docs/cluster_architecture/partitioning.html

The doc is for 1.0 but I think it covers a lot of detail that is omitted by 3.x documentations

Take away

  • Column family data is partitioned across the nodes based on the row key
  • random partitioner (most cases) and byte order partitioner
  • Sequential writes can cause hot spots, although they mentioned it in byte order partitioner, but write to a wide row will also only hit a single node?

Detail

'Data partitioning determines how data is distributed across the nodes in the cluster. Three factors are involved with data distribution:'

  • A partitioner that determines which node to store the data on
  • The number of copies of data,

RandomPartitioner

  • The RandomPartitioner (org.apache.cassandra.dht.RandomPartitioner) is the default partitioning strategy for a Cassandra cluster, and in almost all cases is the right choice
  • The RandomPartitioner uses consistent hashing to determine which node stores which row. Unlike naive modulus-by-node-count, consistent hashing ensures that when nodes are added to the cluster, the minimum possible set of data is effected

ByteOrderPartitioner

  • The ByteOrderPartitioner orders rows lexically by key bytes
  • Using the ordered partitioner allows range scans over rows

It is not recommended due to

  • Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then these writes are not distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data
    • TODO: this is similar to Bigtable's tall and narrow suggestion?
  • More administrative overhead to load balance the cluster
  • Uneven load balancing for multiple column families