Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
75251: kvserver: segregate replica metrics in the replicateQueue r=amygao9 a=amygao9

This commit adds separate metrics for tracking adding/removing/rebalancing of voting/non-voting replicas. For instance, the add replica count is an aggregation of voter and non-voter addition counts, which are now also separately tracked. Also added a metric for tracking adding and removing decommissioning replicas.

Fixes [#60694](#60694)
Release justification: additive change to add metrics for existing functionality

76408: docs: up-replicating learners with snapshots r=amygao9 a=amygao9

This tech-note describes the process of up-replicating new learners, starting off with an introduction to learner snapshots.
It then gives a walkthrough of the process from the replicateQueue, to the change replicas api, to sending the learner snapshot. It's meant to give a more brief overview of the process instead of diving deep into one topic. 

Release note: None
Release justification: Non-production change with adding a tech note

77345: bazel: document usage of custom go runtime r=irfansharif a=irfansharif

Release justification: non-production code changes
Release note: None

77362: sql/sessioninit: use singleflight for populating cache r=ajwerner a=rafiss

This prevents a thundering herd of multiple requests for the same key
when the cache is empty.

Release justification: high value bugfix
Release note: None

77436: dev: improve ~/.bazelrc content check r=postamar a=postamar

Previously, the `dev` build tool could report spurious configuration
instructions regarding the remote build cache. This commit fixes this.

Release justification: Internal build tool improvement
Release note: None

77449: vendor: bump Pebble to e2b7bb844759 r=nicktrav a=jbowens

```
e2b7bb84 db: clear range key count during Batch Reset
9d0c3914 db: use L0 target file size, grandparent limit for intra L0 compactions
e6b9afd6 sstable: fix bug with overall bound setting
dc74f53e internal/manifest: refactor TestCheckOrdering to use ParseVersionDebug
02363a55 tool/logs: add ingested events, reorganize output
e0cf288f tool/logs: fix single memtable flushes
cd3d4bd4 ci: mark and close stale issues
b4f1f653 base: fix InternalIteratorStats.Merge
e7ea208e db: remove obsolete pickIntraL0 function
3515304a db: fix flaky TestOpenWALReplayReadOnlySeqNums
```

Release note: None

Release justification: non-production code changes (3515304a, e7ea208e,
cd3d4bd4, e0cf288f, 02363a55, dc74f53e), bug fixes (e2b7bb84, 9d0c3914,
e6b9afd6, b4f1f653)

Co-authored-by: Amy Gao <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
  • Loading branch information
6 people committed Mar 7, 2022
7 parents cfa9a7f + 9f37eee + e39ec82 + 9170df0 + d92bedd + 542e635 + 6bc31c1 commit 2867db0
Show file tree
Hide file tree
Showing 17 changed files with 811 additions and 94 deletions.
6 changes: 3 additions & 3 deletions DEPS.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -1297,10 +1297,10 @@ def go_deps():
patches = [
"@cockroach//build/patches:com_github_cockroachdb_pebble.patch",
],
sha256 = "5b306806fca1c0defafe031c4cd842934aec44394bc44dc24bc805bc3fd609b8",
strip_prefix = "github.com/cockroachdb/[email protected]20220301234049-69a82fe41c31",
sha256 = "cc3201e4197273c3ddc0adf72ab1f800d7b09e2b9d50422cab619a854d8e4e80",
strip_prefix = "github.com/cockroachdb/[email protected]20220307192532-e2b7bb844759",
urls = [
"https://storage.googleapis.com/cockroach-godeps/gomod/github.com/cockroachdb/pebble/com_github_cockroachdb_pebble-v0.0.0-20220301234049-69a82fe41c31.zip",
"https://storage.googleapis.com/cockroach-godeps/gomod/github.com/cockroachdb/pebble/com_github_cockroachdb_pebble-v0.0.0-20220307192532-e2b7bb844759.zip",
],
)
go_repository(
Expand Down
19 changes: 19 additions & 0 deletions WORKSPACE
Original file line number Diff line number Diff line change
Expand Up @@ -137,16 +137,35 @@ http_archive(
load(
"@io_bazel_rules_go//go:deps.bzl",
"go_download_sdk",
"go_host_sdk",
"go_local_sdk",
"go_register_toolchains",
"go_rules_dependencies",
)

# To point to a mirrored artifact, use:
#
go_download_sdk(
name = "go_sdk",
urls = ["https://storage.googleapis.com/public-bazel-artifacts/go/{}"],
version = "1.17.6",
)

# To point to a local SDK path, use the following instead. We'll call the
# directory into which you cloned the Go repository $GODIR[1]. You'll have to
# first run ./make.bash from $GODIR/src to pick up any custom changes.
#
# [1]: https://go.dev/doc/contribute#testing
#
# go_local_sdk(
# name = "go_sdk",
# path = "<path to $GODIR>",
# )

# To use your whatever your local SDK is, use the following instead:
#
# go_host_sdk(name = "go_sdk")

go_rules_dependencies()

go_register_toolchains(nogo = "@cockroach//:crdb_nogo")
Expand Down
93 changes: 93 additions & 0 deletions docs/tech-notes/change-replicas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Up-replicating replicas with snapshots

This tech note briefly explains the end-to-end process of up-replicating a range, starting from the `replicateQueue` to
sending Learner snapshots. This is meant to serve as a high level overview of the code paths taken, and more detailed
descriptions can be found in tech notes specific to each area.

## Introduction

Raft snapshots are necessary when a follower replica is unable to catch up using the existing Raft logs of the leader.
Such a scenario occurs when an existing replica falls too far behind the rest of the Raft group such that we've
truncated the Raft log above it. On the other hand, Learner snapshots are necessary when we add a new replica (due to a
rebalancing operation) and the new replica needs a snapshot to up-replicate. We should make a small distinction
that `Learner snapshots` are sent during rebalancing operations by the `replicateQueue` and `Raft snapshots`
are sent by the `raftSnapshotQueue` when a replica falls behind. However, there is no difference in the way the
snapshots are sent, generated or applied. For example, in the case where the replication factor is increased, the Raft
group would need to create a new replica from a clean slate. To up-replicate this replica, we would send it a Learner
snapshot of the full range data.

In this note, we will focus on the scenario where a Learner snapshot is needed for the up-replication and rebalancing of
replicas.

## ReplicateQueue

As a brief overview, the `replicateQueue` manages a queue of replicas and is responsible for processing replicas for
rebalancing. For each replica that holds the lease, it invokes the allocator to compute decisions on whether any placement changes are needed
and the `replicateQueue` executes these changes. Then the `replicateQueue` calls `changeReplicas` to act on these change
requests and repeats until all changes are complete.

## ChangeReplicas API

The `AdminChangeReplicas` API exposed by KV is mainly responsible for atomically changing replicas of a range. These
changes include non-voter promotions, voter/non-voter swaps, additions and removals. The change is performed in a
distributed transaction and takes effect when the transaction is committed. There are many details involved in these
changes, but for the purposes of this note, we will focus on up-replication and the process of sending Learner
snapshots.

During up-replication, `ChangeReplicas` runs a transaction to add a new learner replica to the range. Learner
replicas are new replicas that receive Raft traffic but do not participate in quorum; allowing the range to remain
highly available during the replication process.

Once the learners have been added to the range, they are synchronously sent a Learner snapshot from the leaseholder
(which is typically, but not necessarily, the Raft leader) to up-replicate. There is some nuance here as Raft snapshots
are typically automatically sent by the existing Raft snapshot queue. However, we are synchronously sending a snapshot
here in the replicateQueue to quickly catch up learners. To prevent a race with the raftSnapshotQueue, we lock snapshots
to learners and non-voters on the current leaseholder store while processing the change. In addition, we also place a
lock on log truncation to ensure we don't truncate the Raft
log while a snapshot is inflight, preventing wasted snapshots. However, both of these locks are the best effort as
opposed to guaranteed as the lease can be transferred and the new leaseholder could still truncate the log.

## Sending Snapshots

The snapshot process itself is broken into three parts: generating, transmitting and applying the snapshot. This process
is the same whether it is invoked by the `replicateQueue` or `raftSnapshotQueue`.

### Generating the snapshot

A snapshot is a bulk transfer of all replicated data in a range and everything the replica needs to be a member of a
Raft group. It consists of a consistent view of the state of some replica of a range as of an applied index. The
`GetSnapshot` method gets a storage engine snapshot which reflects the replicated state as of the applied index the
snapshot is generated at, and creates an iterator from the storage snapshot. A storage engine snapshot is important to
ensure that multiple iterators created at different points in time see a consistent replicated state that does not
change while the snapshot is streamed. In our current code, the engine snapshot is not necessary for correctness as only
one iterator is constructed.

### Transmitting the snapshot

The snapshot transfer is sent through a bi-directional stream of snapshot requests and snapshot responses. However,
before the actual range data is sent, the streaming rpc first sends a header message to the recipient store and blocks
until the store responds to accept or reject the snapshot data.

The recipient checks the following conditions before accepting the snapshot. We currently allow one concurrent snapshot
application on a receiver store at a time. Consequently, if there are snapshot applications already in-flight on the
receiver, incoming snapshots are throttled or rejected if they wait too long. The receiver then checks whether its store
has a compatible replica present to have the snapshot applied. In the case of up-replication, we expect the learner
replica to be present on the receiver store. However, if the snapshot overlaps an existing replica or replica
placeholder on the receiver, the snapshot will be rejected as well. Once these conditions have been verified, a response
message will be sent back to the sender.

Once the recipient has accepted, the sender proceeds to use the iterator to read chunks of
size `kv.snapshot_sender.batch_size` from storage into memory. The batch size is enforced as to not hold on to a
significant chunk of data in memory. Then, in-memory batches are created to send the snapshot data in streaming grpc
message chunks to the recipient. Note that snapshots are rate limited depending on cluster settings as to not overload
the network. Finally, once all the data is sent, the sender sends a final message to the receiver and waits for a
response from the recipient indicating whether the snapshot was a success. On the receiver side, the key-value pairs are
used to construct multiple SSTs for direct ingestion, which prevents the receiver from holding the entire snapshot in
memory.

### Applying the snapshot

Once the snapshot is received on the receiver, defensive checks are applied to ensure the correct snapshot is received.
It ensures there is an initialized learner replica or creates a placeholder to accept the snapshot. Finally, the Raft
snapshot message is handed to the replica’s Raft node, which will apply the snapshot. The placeholder or learner is then
converted to a fully initialized replica.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ require (
github.com/cockroachdb/go-test-teamcity v0.0.0-20191211140407-cff980ad0a55
github.com/cockroachdb/gostdlib v1.13.0
github.com/cockroachdb/logtags v0.0.0-20211118104740-dabe8e521a4f
github.com/cockroachdb/pebble v0.0.0-20220301234049-69a82fe41c31
github.com/cockroachdb/pebble v0.0.0-20220307192532-e2b7bb844759
github.com/cockroachdb/redact v1.1.3
github.com/cockroachdb/returncheck v0.0.0-20200612231554-92cdbca611dd
github.com/cockroachdb/stress v0.0.0-20220217190341-94cf65c2a29f
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -438,8 +438,8 @@ github.com/cockroachdb/logtags v0.0.0-20211118104740-dabe8e521a4f h1:6jduT9Hfc0n
github.com/cockroachdb/logtags v0.0.0-20211118104740-dabe8e521a4f/go.mod h1:Vz9DsVWQQhf3vs21MhPMZpMGSht7O/2vFW2xusFUVOs=
github.com/cockroachdb/panicparse/v2 v2.0.0-20211103220158-604c82a44f1e h1:FrERdkPlRj+v7fc+PGpey3GUiDGuTR5CsmLCA54YJ8I=
github.com/cockroachdb/panicparse/v2 v2.0.0-20211103220158-604c82a44f1e/go.mod h1:pMxsKyCewnV3xPaFvvT9NfwvDTcIx2Xqg0qL5Gq0SjM=
github.com/cockroachdb/pebble v0.0.0-20220301234049-69a82fe41c31 h1:hkBQIZdAeGQzF3pVbL1lrPJNx/Hvwiy5HHpRf0Nz6y8=
github.com/cockroachdb/pebble v0.0.0-20220301234049-69a82fe41c31/go.mod h1:buxOO9GBtOcq1DiXDpIPYrmxY020K2A8lOrwno5FetU=
github.com/cockroachdb/pebble v0.0.0-20220307192532-e2b7bb844759 h1:PKDkU1nARLt2TL99CCEukKJIEUo2cgt9AEVlrdtzfuM=
github.com/cockroachdb/pebble v0.0.0-20220307192532-e2b7bb844759/go.mod h1:buxOO9GBtOcq1DiXDpIPYrmxY020K2A8lOrwno5FetU=
github.com/cockroachdb/redact v1.0.8/go.mod h1:BVNblN9mBWFyMyqK1k3AAiSxhvhfK2oOZZ2lK+dpvRg=
github.com/cockroachdb/redact v1.1.3 h1:AKZds10rFSIj7qADf0g46UixK8NNLwWTNdCIGS5wfSQ=
github.com/cockroachdb/redact v1.1.3/go.mod h1:BVNblN9mBWFyMyqK1k3AAiSxhvhfK2oOZZ2lK+dpvRg=
Expand Down
16 changes: 6 additions & 10 deletions pkg/cmd/dev/cache.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,25 +61,21 @@ func (d *dev) cache(cmd *cobra.Command, _ []string) error {
if clean {
return d.cleanCache(ctx)
}
if down {
return d.tearDownCache(ctx)
}
if reset {
// Errors here don't really mean much, we can just ignore them.
err := d.tearDownCache(ctx)
if err != nil {
log.Printf("%v\n", err)
}
bazelRcLine, err := d.setUpCache(ctx)
if bazelRcLine != "" {
fmt.Printf("Please add `%s` to your ~/.bazelrc\n", bazelRcLine)
}
return err
}
if down {
return d.tearDownCache(ctx)
}
bazelRcLine, err := d.setUpCache(ctx)
if bazelRcLine != "" {
fmt.Printf("Please add `%s` to your ~/.bazelrc\n", bazelRcLine)
if err != nil {
return err
}
_, err = d.checkPresenceInBazelRc(bazelRcLine)
return err
}

Expand Down
36 changes: 29 additions & 7 deletions pkg/cmd/dev/doctor.go
Original file line number Diff line number Diff line change
Expand Up @@ -222,16 +222,10 @@ Please add one of the following to your %s/.bazelrc.user:`, workspace)
if err != nil {
return err
}
homeDir, err := os.UserHomeDir()
success, err = d.checkPresenceInBazelRc(bazelRcLine)
if err != nil {
return err
}
bazelRcContents, err := d.os.ReadFile(filepath.Join(homeDir, ".bazelrc"))
if err != nil || !strings.Contains(bazelRcContents, bazelRcLine) {
log.Printf("Please add the string `%s` to your ~/.bazelrc:\n", bazelRcLine)
log.Printf(" echo \"%s\" >> ~/.bazelrc", bazelRcLine)
success = false
}
}

if !success {
Expand All @@ -244,3 +238,31 @@ Please add one of the following to your %s/.bazelrc.user:`, workspace)
log.Println("You are ready to build :)")
return nil
}

func (d *dev) checkPresenceInBazelRc(expectedBazelRcLine string) (found bool, _ error) {
homeDir, err := os.UserHomeDir()
if err != nil {
return false, err
}
defer func() {
if !found {
log.Printf("Please add the string `%s` to your ~/.bazelrc:\n", expectedBazelRcLine)
log.Printf(" echo \"%s\" >> ~/.bazelrc", expectedBazelRcLine)
}
}()

bazelRcContents, err := d.os.ReadFile(filepath.Join(homeDir, ".bazelrc"))
if err != nil {
return false, err
}
for _, line := range strings.Split(bazelRcContents, "\n") {
if !strings.Contains(line, expectedBazelRcLine) {
continue
}
if strings.HasPrefix(strings.TrimSpace(line), "#") {
continue
}
return true, nil
}
return false, nil
}
11 changes: 11 additions & 0 deletions pkg/kv/kvserver/allocator.go
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,17 @@ const (
nonVoterTarget
)

// replicaStatus represents whether a replica is currently alive,
// dead or decommissioning.
type replicaStatus int

const (
_ replicaStatus = iota
alive
dead
decommissioning
)

// AddChangeType returns the roachpb.ReplicaChangeType corresponding to the
// given targetReplicaType.
//
Expand Down
Loading

0 comments on commit 2867db0

Please sign in to comment.