-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover replicas to a "good enough" store instead of the "best" store #86265
Comments
When running with 8 nodes, decommissioning one node, and adding another node, we get: and with PR #86267 we get: We do see that the decommission is faster with the fix, but it still slows down towards the end. This should be fix by another change to the receiver side priority queue. In this test the decommissioning is more than 2x faster. More tests here https://docs.google.com/spreadsheets/d/1QDSHwQsW68E99wQDNvNFiTj_meH5BBGX0jrTAXAOEjY/edit#gid=0 These tests are slow and noisy but overall:
|
83531: rfc: add rfc for invisible index feature r=wenyihu6 a=wenyihu6 This commit adds an RFC for the invisible index feature. Related issue: #72576, #82363 Release justification: low risk to the existing functionality; this commit just adds rfc. Release Note: none 86267: allocator: select a good enough store for decom/recovery r=lidorcarmel a=lidorcarmel Until now, when decommissioning a node, or when recovering from a dead node, the allocator tries to pick one of the best possible stores as the target for the recovery. Because of that, we sometimes see multiple stores recover replicas to the same store, for example, when decommissioning a node and at the same time adding a new node. This PR changes the way we select a destination store by choosing a random store out of all the stores that are "good enough" for the replica. The risk diversity is still enforced, but we may recover a replica to a store that is considered "over full", for example. Note that during upreplication the allocator will still try to use one of the "best" stores as targets. Fixes: #86265 Release note: None Release justification: a relatively small change, and it can be reverted by setting kv.allocator.recovery_store_selector=best. 86345: clusterversion: prevent upgrades from development versions r=ajwerner a=dt This change defines a new "unstableVersionsAbove" point on the cluster version line, above which any cluster versions are considered unstable development-only versions which are still subject to change. Performing an upgrade to a version while it is still unstable leaves a cluster in a state where it persists a version that claims it has done that upgrade and all prior, however those upgrades are still subject to change by nature of being unstable. If it subsequently upgraded to a stable version, this could result in subtle and nearly impossible to detect issues, as being at or above a particular version is used to assume that all subsequent version upgrades _as released_ were run; on a cluster that ran an earlier iteration of an upgrade this does not hold. Thus to prevent clusters which upgrade to development versions from subsequently upgrading to a stable version, we offset all development versions -- those above the unstableVersionsAbove point -- into the far future by adding one million to their major version e.g. v22.x-y becomes 1000022.x-y. This means an attempt to subsequently "upgrade" to a stable version -- such as v22.2 -- will look like a downgrade and be forbidden. On the release branch, prior to starting to publish upgradable releases, the unstableVersionsAbove value should be set to invalidVersionKey to reflect that all version upgrades in that release branch are now considered to be stable, meaning they must be treated as immutable and append-only. Release note (ops change): clusters that are upgraded to an alpha or other manual build from the development branch will not be able to be subsequently upgraded to a release build. Release justification: high-priority change to existing functionality, to allow releasing alphas with known version upgrade bugs while ensuring they do not subsequently upgrade into stable version but silently corrupted clusters. 86630: kvserver: add additional testing to multiqueue r=AlexTalks a=AlexTalks Add testing for cancelation of multi-queue requests and fix a bug where the channel wasn't closed on task cancelation. Release justification: Test-only change. Release note: None 86801: ttl: add queries to job details r=otan a=rafiss fixes #81905 This helps with observability so users can understand what the TTL job is doing behind the scenes. The job details in the DB console will show: ``` ttl for defaultdb.public.t -- for each range, iterate to find rows: SELECT id FROM [108 AS tbl_name] AS OF SYSTEM TIME '30s' WHERE <crdb_internal_expiration OR ttl_expiration_expression> <= $1 AND (id) > (<range start>) AND (id) < (<range end>) ORDER BY id LIMIT <ttl_select_batch_size> -- then delete with: DELETE FROM [108 AS tbl_name] WHERE <crdb_internal_expiration OR ttl_expiration_expression> <= $1 AND (id) IN (<rows selected above>) ``` Release note: None Release justification: low risk change 86876: kv: Include error information in `crdb_internal.active_range_feeds` r=miretskiy a=miretskiy Include error count, and the last error information in `crdb_internal.active_range_feeds` table whenever rangefeed disconnects due to an error. Release justification: observability improvement. Release note: None 86901: sql: fix cluster_execution_insights priority level r=j82w a=j82w This fixes the priority level and converts it to be a string. closes: #86900, closes #86867 Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (sql change): Fix the insight execution priority to have correct value instead of always being default. Changed the column to be a string to avoid converting it in the ui. 86921: kvserver: incorporate sending queue priority into snapshot requests r=AlexTalks a=AlexTalks This change modifies the `(Delegated)SnapshotRequest` Raft RPCs in order to incorporate the name of the sending queue, as well as the sending queue's priority, in order to be used to prioritize queued snapshots on a receiving store. Release justification: Low-risk change to existing functionality. Release note: None 86923: dev: bump version r=yuzefovich a=yuzefovich This commit bumps version since there appears to have been a "merge skew" between #85095 and #86167, and somehow I had a `dev` binary that didn't include the benchmark fix from the latter. Release justification: test-only change. Release note: None Co-authored-by: wenyihu3 <[email protected]> Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
In order to make decommissioning faster, we want the allocator to recover replicas to a store that is valid, and has the same risk diversity as the best candidate store, but may have more ranges than other stores.
Today we try to pick one of the best stores with a low range count, and if, for example, one of the nodes is new, we see thundering herd during decommissioning. By choosing a store that is good enough we can use more stores as targets and complete the decommission process faster.
The same applies for recovering from a dead node, but not for upreplicating - meaning, if for example we change the default from 3 replicas to 5, then the allocator will still try to allocate the 2 additional replicas on one of the best stores.
The downside is an increase in cost and bandwidth for recovery because we may recover to a store which is overloaded and later rebalance the same replica to a store that is underfull.
This effort is part of #85445.
Jira issue: CRDB-18657
The text was updated successfully, but these errors were encountered: