Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover replicas to a "good enough" store instead of the "best" store #86265

Closed
lidorcarmel opened this issue Aug 16, 2022 · 1 comment · Fixed by #86267
Closed

Recover replicas to a "good enough" store instead of the "best" store #86265

lidorcarmel opened this issue Aug 16, 2022 · 1 comment · Fixed by #86267
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@lidorcarmel
Copy link
Contributor

lidorcarmel commented Aug 16, 2022

In order to make decommissioning faster, we want the allocator to recover replicas to a store that is valid, and has the same risk diversity as the best candidate store, but may have more ranges than other stores.

Today we try to pick one of the best stores with a low range count, and if, for example, one of the nodes is new, we see thundering herd during decommissioning. By choosing a store that is good enough we can use more stores as targets and complete the decommission process faster.

The same applies for recovering from a dead node, but not for upreplicating - meaning, if for example we change the default from 3 replicas to 5, then the allocator will still try to allocate the 2 additional replicas on one of the best stores.

The downside is an increase in cost and bandwidth for recovery because we may recover to a store which is overloaded and later rebalance the same replica to a store that is underfull.

This effort is part of #85445.

Jira issue: CRDB-18657

@lidorcarmel lidorcarmel added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 16, 2022
@lidorcarmel lidorcarmel self-assigned this Aug 16, 2022
@lidorcarmel
Copy link
Contributor Author

When running with 8 nodes, decommissioning one node, and adding another node, we get:

Screen Shot 2022-08-16 at 3 37 02 PM

and with PR #86267 we get:

Screen Shot 2022-08-16 at 3 37 17 PM

We do see that the decommission is faster with the fix, but it still slows down towards the end. This should be fix by another change to the receiver side priority queue. In this test the decommissioning is more than 2x faster.

More tests here https://docs.google.com/spreadsheets/d/1QDSHwQsW68E99wQDNvNFiTj_meH5BBGX0jrTAXAOEjY/edit#gid=0

These tests are slow and noisy but overall:

  • when running without a new node, and without draining the decommissioned node first, we see a small regression in the time to decommission (we recommend to drain first, so this test is probably less important).
  • when running without a new node, but with first draining the decommissioned node, the decommission time is roughly the same.
  • when adding a new node, without a drain, the decommission time is almost 2x faster.
  • when adding a new node and draining the decommissioned node, the decommissioning time is about 37% of the original time (more than 2.5x faster).

craig bot pushed a commit that referenced this issue Aug 26, 2022
83531: rfc: add rfc for invisible index feature r=wenyihu6 a=wenyihu6

This commit adds an RFC for the invisible index feature.

Related issue: #72576, #82363

Release justification: low risk to the existing functionality; this commit just
adds rfc.

Release Note: none

86267: allocator: select a good enough store for decom/recovery r=lidorcarmel a=lidorcarmel

Until now, when decommissioning a node, or when recovering from a dead
node, the allocator tries to pick one of the best possible stores as
the target for the recovery.

Because of that, we sometimes see multiple stores recover replicas
to the same store, for example, when decommissioning a node and
at the same time adding a new node.

This PR changes the way we select a destination store by choosing
a random store out of all the stores that are "good enough" for
the replica. The risk diversity is still enforced, but we may
recover a replica to a store that is considered "over full", for
example.

Note that during upreplication the allocator will still try to use
one of the "best" stores as targets.

Fixes: #86265

Release note: None

Release justification: a relatively small change, and it can be
reverted by setting kv.allocator.recovery_store_selector=best.

86345: clusterversion: prevent upgrades from development versions r=ajwerner a=dt

This change defines a new "unstableVersionsAbove" point on the cluster
version  line, above which any cluster versions are considered unstable
development-only versions which are still subject to change.

Performing an upgrade to a version while it is still unstable leaves a
cluster in a state where it persists a version that claims it has done
that upgrade and all prior, however those upgrades are still subject to
change by nature of being unstable. If it subsequently upgraded to a
stable version, this could result in subtle and nearly impossible to
detect issues, as being at or above a particular version is used to
assume that all subsequent version upgrades _as released_ were run; on a
cluster that ran an earlier iteration of an upgrade this does not hold.

Thus to prevent clusters which upgrade to development versions from
subsequently upgrading to a stable version, we offset all development
versions -- those above the unstableVersionsAbove point -- into the far
future by adding one million to their major version e.g. v22.x-y becomes
1000022.x-y. This means an attempt to subsequently "upgrade" to a stable
version -- such as v22.2 -- will look like a downgrade and be forbidden.

On the release branch, prior to starting to publish upgradable releases,
the unstableVersionsAbove value should be set to invalidVersionKey to
reflect that all version upgrades in that release branch are now
considered to be stable, meaning they must be treated as immutable and
append-only.

Release note (ops change): clusters that are upgraded to an alpha or
other manual build from the development branch will not be able to be
subsequently upgraded to a release build.

Release justification: high-priority change to existing functionality,
to allow releasing alphas with known version upgrade bugs while ensuring
they do not subsequently upgrade into stable version but silently
corrupted clusters.

86630: kvserver: add additional testing to multiqueue r=AlexTalks a=AlexTalks

Add testing for cancelation of multi-queue requests
and fix a bug where the channel wasn't closed on task
cancelation.

Release justification: Test-only change.
Release note: None

86801: ttl: add queries to job details r=otan a=rafiss

fixes #81905

This helps with observability so users can understand what the TTL job
is doing behind the scenes.

The job details in the DB console will show:
```
ttl for defaultdb.public.t
-- for each range, iterate to find rows:
SELECT id FROM [108 AS tbl_name]
AS OF SYSTEM TIME '30s'
WHERE <crdb_internal_expiration OR ttl_expiration_expression> <= $1
AND (id) > (<range start>) AND (id) < (<range end>)
ORDER BY id
LIMIT <ttl_select_batch_size>
-- then delete with:
DELETE FROM [108 AS tbl_name]
WHERE <crdb_internal_expiration OR ttl_expiration_expression> <= $1
AND (id) IN (<rows selected above>)
```

Release note: None

Release justification: low risk change

86876: kv: Include error information in `crdb_internal.active_range_feeds` r=miretskiy a=miretskiy

Include error count, and the last error information in
`crdb_internal.active_range_feeds` table whenever rangefeed
disconnects due to an error.

Release justification: observability improvement.
Release note: None

86901: sql: fix cluster_execution_insights priority level r=j82w a=j82w

This fixes the priority level and converts it to be a string.

closes: #86900, closes #86867

Release justification: Category 2: Bug fixes and
low-risk updates to new functionality

Release note (sql change): Fix the insight
execution priority to have correct value
instead of always being default. Changed
the column to be a string to avoid
converting it in the ui.

86921: kvserver: incorporate sending queue priority into snapshot requests r=AlexTalks a=AlexTalks

This change modifies the `(Delegated)SnapshotRequest` Raft RPCs in order
to incorporate the name of the sending queue, as well as the sending
queue's priority, in order to be used to prioritize queued snapshots on
a receiving store.

Release justification: Low-risk change to existing functionality.
Release note: None

86923: dev: bump version r=yuzefovich a=yuzefovich

This commit bumps version since there appears to have been a "merge
skew" between #85095 and #86167, and somehow I had a `dev` binary that
didn't include the benchmark fix from the latter.

Release justification: test-only change.

Release note: None

Co-authored-by: wenyihu3 <[email protected]>
Co-authored-by: Lidor Carmel <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Alex Sarkesian <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 3ce0fd5 Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant