Skip to content

Commit

Permalink
docs: rework section on node storage
Browse files Browse the repository at this point in the history
This adds a little bit on repair and rebalancing as well.

Part of cockroachdb#9634.
  • Loading branch information
BramGruneir committed Sep 30, 2016
1 parent 3b38e26 commit 2c0f3f1
Showing 1 changed file with 44 additions and 21 deletions.
65 changes: 44 additions & 21 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -629,27 +629,50 @@ There are some additional system entries sprinkled amongst the
non-system keys. See the Key-Prefix Accounting section in this document
for further details.

# Node Storage

Nodes maintain a separate instance of RocksDB for each disk. Each
RocksDB instance hosts any number of ranges. RPCs arriving at a
RoachNode are multiplexed based on the disk name to the appropriate
RocksDB instance. A single instance per disk is used to avoid
contention. If every range maintained its own RocksDB, global management
of available cache memory would be impossible and writers for each range
would compete for non-contiguous writes to multiple RocksDB logs.

In addition to the key/value pairs of the range itself, various range
metadata is maintained.

- participating replicas

- consensus metadata

- split/merge activity

A really good reference on tuning Linux installations with RocksDB is
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
# Stores and Storage

Nodes contain one ore more stores. And each of those stores should be on a
unique disk and contains an instance of RocksDB. And these stores in turn have
a collection of range replicas. More than one replica for a range will never
occur on the same store or even the same node.

Early on, when a cluster is first initialized, the few default starting ranges
will only have a single replica, but as soon as other nodes are available they
will replicate to them until they've reached their desired replication factor,
the default being 3. Ranges can have different replication factors and when set
they will up or down replicate to the appropriate number of replicas. Since
ranges are only created via splits, all replicas for a range split at the same
time so the replication factors rarely need to be adjusted.

# Self Repair

If a store has not been heard from (gossiped their descriptors) in some time,
the default setting being 5 minutes, the cluster will consider this store to be
dead. When this happens, all ranges that have replicas on that store are
determined to be unavailable and removed. These ranges will then upreplicate
themselves to other available stores until their desired replication factor is
again met. If 50% or more of the replicas are unavailable at the same time,
there is no quorum and the whole range will be considered unavailable until at
least greater than 50% of the replicas are again available.

# Rebalancing

As more data are added to the system, some stores may grow faster than others.
To combat this and to spread the overall load across the full cluster, replicas
will be moved between stores maintaining the desired replication factor. The
heuristics used to perform this rebalancing are in flux, but they take into
account a number of different factors:

- the number of replicas per store
- the total size of the data used per store
- free space available per store

In the future, some other factors that might be considered include:

- cpu/network load per store
- ranges that are used together often in queries
- number of active ranges per store
- ranges that already have some overlap (to decrease the chance of data loss)

# Range Metadata

Expand Down

0 comments on commit 2c0f3f1

Please sign in to comment.