-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: create test where node rejoins cluster after being down for multiple hours #115648
Comments
Some notes about recreating this manually
Install docker
Init tpc-e - takes ~1 hour
Run test against first 11 nodes
|
I have some good manual reproduction of this and will turn into a roachtest tomorrow.
23.1.13 with a change to three setting:
|
As part of this investigation, I've found a number of reasons for slowdowns after the node is shut down and restarted
The changes to the test in #116291 expose many of these issues. We are still far away from having no impact, but this test can be used as a benchmark to improve the situation over time. |
In escalations we have seen different behaviors with more fill and longer outages using the local disk. This test fills with data for 2 hours before starting the outages. It attempts to mitigate the outage by disabling foreground AC and setting lease preference to shed. Epic: none Fixes: cockroachdb#115648 Release note: None
119783: roachtest: make the outage for kv/restart longer r=andrewbaptist a=kvoli In escalations we have seen different behaviors with more fill and longer outages using the local disk. `kv/restart/nodes=12` fills with data for 2 hours before stopping one node for 10 minutes and asserting that cluster throughput remains above 50% after rejoining Resolves: #115648 Release note: None 120894: server: sql activity handler refactor r=xinhaoz a=dhartunian This PR is a collection of 3 commits that each execute the same refactor on different functions in the `combined_statement_stats.go` file. The way stats retrieval works is that we have logic that determines automatically which of 3 different tables to query. These are generally referred to as "activity" (topK rollups), "persisted" (periodically aggregated stats across all nodes from in-memory data), and "combined" (union of data from in-memory and "persisted", aggregated on demand). Previously, each of these functions would make its own decision about which table to query based on whether data would be returned. The methods would each try "activity" -> "persisted" -> "combined" in that order until **something** would return and then they would return that data. Now, we make this determination prior to dispatching the request and use the table selection to inform the specific method necessary for the data request. This ensures consistency between the query that computes the denominator for "% of runtime" metrics, and the ones that return the individual statement stats. This PR is marked as a backport because it is a pre-requisite to enforcing pre-flight result size restrictions on this handler (#120443). Part of #120443 Epic: None Release note: None Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: David Hartunian <[email protected]>
Is your feature request related to a problem? Please describe.
When a cockroach node rejoins the cluster, its (remaining) replicas need to be caught up. The catchup can cause IO saturation leading to increased tail latency.
Describe the solution you'd like
Write a roachtest, Similar to
kv/restart/nodes=12
, which brings nodes down for 10 minutes. In this test, a node should remain down for at least an hour and contain a large enough number of replicas (bytes) that it will not have all its replicas replaced within the hour.The test should assert or observe the impact of the rejoining node on cluster throughput, where a moderately write heavy workload will run throughout.
See earlier work done in #96521 and catch up pacing proposed in #98710.
Jira issue: CRDB-34142
The text was updated successfully, but these errors were encountered: