-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
75251: kvserver: segregate replica metrics in the replicateQueue r=amygao9 a=amygao9 This commit adds separate metrics for tracking adding/removing/rebalancing of voting/non-voting replicas. For instance, the add replica count is an aggregation of voter and non-voter addition counts, which are now also separately tracked. Also added a metric for tracking adding and removing decommissioning replicas. Fixes [#60694](#60694) Release justification: additive change to add metrics for existing functionality 76408: docs: up-replicating learners with snapshots r=amygao9 a=amygao9 This tech-note describes the process of up-replicating new learners, starting off with an introduction to learner snapshots. It then gives a walkthrough of the process from the replicateQueue, to the change replicas api, to sending the learner snapshot. It's meant to give a more brief overview of the process instead of diving deep into one topic. Release note: None Release justification: Non-production change with adding a tech note 77345: bazel: document usage of custom go runtime r=irfansharif a=irfansharif Release justification: non-production code changes Release note: None 77362: sql/sessioninit: use singleflight for populating cache r=ajwerner a=rafiss This prevents a thundering herd of multiple requests for the same key when the cache is empty. Release justification: high value bugfix Release note: None 77436: dev: improve ~/.bazelrc content check r=postamar a=postamar Previously, the `dev` build tool could report spurious configuration instructions regarding the remote build cache. This commit fixes this. Release justification: Internal build tool improvement Release note: None 77449: vendor: bump Pebble to e2b7bb844759 r=nicktrav a=jbowens ``` e2b7bb84 db: clear range key count during Batch Reset 9d0c3914 db: use L0 target file size, grandparent limit for intra L0 compactions e6b9afd6 sstable: fix bug with overall bound setting dc74f53e internal/manifest: refactor TestCheckOrdering to use ParseVersionDebug 02363a55 tool/logs: add ingested events, reorganize output e0cf288f tool/logs: fix single memtable flushes cd3d4bd4 ci: mark and close stale issues b4f1f653 base: fix InternalIteratorStats.Merge e7ea208e db: remove obsolete pickIntraL0 function 3515304a db: fix flaky TestOpenWALReplayReadOnlySeqNums ``` Release note: None Release justification: non-production code changes (3515304a, e7ea208e, cd3d4bd4, e0cf288f, 02363a55, dc74f53e), bug fixes (e2b7bb84, 9d0c3914, e6b9afd6, b4f1f653) Co-authored-by: Amy Gao <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Jackson Owens <[email protected]>
- Loading branch information
Showing
17 changed files
with
811 additions
and
94 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1297,10 +1297,10 @@ def go_deps(): | |
patches = [ | ||
"@cockroach//build/patches:com_github_cockroachdb_pebble.patch", | ||
], | ||
sha256 = "5b306806fca1c0defafe031c4cd842934aec44394bc44dc24bc805bc3fd609b8", | ||
strip_prefix = "github.com/cockroachdb/[email protected]20220301234049-69a82fe41c31", | ||
sha256 = "cc3201e4197273c3ddc0adf72ab1f800d7b09e2b9d50422cab619a854d8e4e80", | ||
strip_prefix = "github.com/cockroachdb/[email protected]20220307192532-e2b7bb844759", | ||
urls = [ | ||
"https://storage.googleapis.com/cockroach-godeps/gomod/github.com/cockroachdb/pebble/com_github_cockroachdb_pebble-v0.0.0-20220301234049-69a82fe41c31.zip", | ||
"https://storage.googleapis.com/cockroach-godeps/gomod/github.com/cockroachdb/pebble/com_github_cockroachdb_pebble-v0.0.0-20220307192532-e2b7bb844759.zip", | ||
], | ||
) | ||
go_repository( | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Up-replicating replicas with snapshots | ||
|
||
This tech note briefly explains the end-to-end process of up-replicating a range, starting from the `replicateQueue` to | ||
sending Learner snapshots. This is meant to serve as a high level overview of the code paths taken, and more detailed | ||
descriptions can be found in tech notes specific to each area. | ||
|
||
## Introduction | ||
|
||
Raft snapshots are necessary when a follower replica is unable to catch up using the existing Raft logs of the leader. | ||
Such a scenario occurs when an existing replica falls too far behind the rest of the Raft group such that we've | ||
truncated the Raft log above it. On the other hand, Learner snapshots are necessary when we add a new replica (due to a | ||
rebalancing operation) and the new replica needs a snapshot to up-replicate. We should make a small distinction | ||
that `Learner snapshots` are sent during rebalancing operations by the `replicateQueue` and `Raft snapshots` | ||
are sent by the `raftSnapshotQueue` when a replica falls behind. However, there is no difference in the way the | ||
snapshots are sent, generated or applied. For example, in the case where the replication factor is increased, the Raft | ||
group would need to create a new replica from a clean slate. To up-replicate this replica, we would send it a Learner | ||
snapshot of the full range data. | ||
|
||
In this note, we will focus on the scenario where a Learner snapshot is needed for the up-replication and rebalancing of | ||
replicas. | ||
|
||
## ReplicateQueue | ||
|
||
As a brief overview, the `replicateQueue` manages a queue of replicas and is responsible for processing replicas for | ||
rebalancing. For each replica that holds the lease, it invokes the allocator to compute decisions on whether any placement changes are needed | ||
and the `replicateQueue` executes these changes. Then the `replicateQueue` calls `changeReplicas` to act on these change | ||
requests and repeats until all changes are complete. | ||
|
||
## ChangeReplicas API | ||
|
||
The `AdminChangeReplicas` API exposed by KV is mainly responsible for atomically changing replicas of a range. These | ||
changes include non-voter promotions, voter/non-voter swaps, additions and removals. The change is performed in a | ||
distributed transaction and takes effect when the transaction is committed. There are many details involved in these | ||
changes, but for the purposes of this note, we will focus on up-replication and the process of sending Learner | ||
snapshots. | ||
|
||
During up-replication, `ChangeReplicas` runs a transaction to add a new learner replica to the range. Learner | ||
replicas are new replicas that receive Raft traffic but do not participate in quorum; allowing the range to remain | ||
highly available during the replication process. | ||
|
||
Once the learners have been added to the range, they are synchronously sent a Learner snapshot from the leaseholder | ||
(which is typically, but not necessarily, the Raft leader) to up-replicate. There is some nuance here as Raft snapshots | ||
are typically automatically sent by the existing Raft snapshot queue. However, we are synchronously sending a snapshot | ||
here in the replicateQueue to quickly catch up learners. To prevent a race with the raftSnapshotQueue, we lock snapshots | ||
to learners and non-voters on the current leaseholder store while processing the change. In addition, we also place a | ||
lock on log truncation to ensure we don't truncate the Raft | ||
log while a snapshot is inflight, preventing wasted snapshots. However, both of these locks are the best effort as | ||
opposed to guaranteed as the lease can be transferred and the new leaseholder could still truncate the log. | ||
|
||
## Sending Snapshots | ||
|
||
The snapshot process itself is broken into three parts: generating, transmitting and applying the snapshot. This process | ||
is the same whether it is invoked by the `replicateQueue` or `raftSnapshotQueue`. | ||
|
||
### Generating the snapshot | ||
|
||
A snapshot is a bulk transfer of all replicated data in a range and everything the replica needs to be a member of a | ||
Raft group. It consists of a consistent view of the state of some replica of a range as of an applied index. The | ||
`GetSnapshot` method gets a storage engine snapshot which reflects the replicated state as of the applied index the | ||
snapshot is generated at, and creates an iterator from the storage snapshot. A storage engine snapshot is important to | ||
ensure that multiple iterators created at different points in time see a consistent replicated state that does not | ||
change while the snapshot is streamed. In our current code, the engine snapshot is not necessary for correctness as only | ||
one iterator is constructed. | ||
|
||
### Transmitting the snapshot | ||
|
||
The snapshot transfer is sent through a bi-directional stream of snapshot requests and snapshot responses. However, | ||
before the actual range data is sent, the streaming rpc first sends a header message to the recipient store and blocks | ||
until the store responds to accept or reject the snapshot data. | ||
|
||
The recipient checks the following conditions before accepting the snapshot. We currently allow one concurrent snapshot | ||
application on a receiver store at a time. Consequently, if there are snapshot applications already in-flight on the | ||
receiver, incoming snapshots are throttled or rejected if they wait too long. The receiver then checks whether its store | ||
has a compatible replica present to have the snapshot applied. In the case of up-replication, we expect the learner | ||
replica to be present on the receiver store. However, if the snapshot overlaps an existing replica or replica | ||
placeholder on the receiver, the snapshot will be rejected as well. Once these conditions have been verified, a response | ||
message will be sent back to the sender. | ||
|
||
Once the recipient has accepted, the sender proceeds to use the iterator to read chunks of | ||
size `kv.snapshot_sender.batch_size` from storage into memory. The batch size is enforced as to not hold on to a | ||
significant chunk of data in memory. Then, in-memory batches are created to send the snapshot data in streaming grpc | ||
message chunks to the recipient. Note that snapshots are rate limited depending on cluster settings as to not overload | ||
the network. Finally, once all the data is sent, the sender sends a final message to the receiver and waits for a | ||
response from the recipient indicating whether the snapshot was a success. On the receiver side, the key-value pairs are | ||
used to construct multiple SSTs for direct ingestion, which prevents the receiver from holding the entire snapshot in | ||
memory. | ||
|
||
### Applying the snapshot | ||
|
||
Once the snapshot is received on the receiver, defensive checks are applied to ensure the correct snapshot is received. | ||
It ensures there is an initialized learner replica or creates a placeholder to accept the snapshot. Finally, the Raft | ||
snapshot message is handed to the replica’s Raft node, which will apply the snapshot. The placeholder or learner is then | ||
converted to a fully initialized replica. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.