roachtest: create test where node rejoins cluster after being down for multiple hours #115648

kvoli · 2023-12-05T20:59:27Z

Is your feature request related to a problem? Please describe.

When a cockroach node rejoins the cluster, its (remaining) replicas need to be caught up. The catchup can cause IO saturation leading to increased tail latency.

Describe the solution you'd like

Write a roachtest, Similar to kv/restart/nodes=12, which brings nodes down for 10 minutes. In this test, a node should remain down for at least an hour and contain a large enough number of replicas (bytes) that it will not have all its replicas replaced within the hour.

The test should assert or observe the impact of the rejoining node on cluster throughput, where a moderately write heavy workload will run throughout.

See earlier work done in #96521 and catch up pacing proposed in #98710.

Jira issue: CRDB-34142

The text was updated successfully, but these errors were encountered:

andrewbaptist · 2023-12-11T14:55:31Z

Some notes about recreating this manually
Create cluster:

roachprod create -n 13 $CLUSTER
roachprod stage $CLUSTER release v23.1.12 
roachprod start $CLUSTER:1-12\

Install docker

roachprod ssh $CLUSTER:13
sudo apt  install docker.io

Init tpc-e - takes ~1 hour

roachprod ssh $CLUSTER:13 "sudo docker run cockroachdb/tpc-e:latest --init -c25000 --hosts $(roachprod ip $CLUSTER:1)"

Run test against first 11 nodes

roachprod ssh $CLUSTER:13 "sudo docker run cockroachdb/tpc-e:latest -c25000 -d30m $(roachprod ip $CLUSTER:1-11 | sed s/^/--hosts=/ | tr '\n' ' ')"

Wait 5 minutes for the workload,
Gracefully stop node 12
Wait 5 more minutes
Start node 12

sleep 300
roachprod stop $CLUSTER:12 --sig 15
sleep 300
roachprod start $CLUSTER:12

andrewbaptist · 2023-12-12T00:00:23Z

I have some good manual reproduction of this and will turn into a roachtest tomorrow.
The set of commands is:

roachprod create -n 13 $CLUSTER --local-ssd
roachprod stage $CLUSTER release v23.1.12 
roachprod start $CLUSTER:1-12

roachprod ssh $CLUSTER:1 "./cockroach workload init kv --splits 5000"
roachprod ssh $CLUSTER:13 "./cockroach workload run kv {pgurl:1-11} --read-percent 10 --max-block-bytes 4096 --concurrency 256 --max-rate 12000"

# Wait for ~2 hours to let sufficient fill in the system. The workload continues during the following steps.

roachprod stop $CLUSTER:12 --sig 15
sleep 600
roachprod start $CLUSTER:12

22.2.13 default settings:

23.1.13 default settings:

23.1.13 with a change to three setting:

AC disabled for foreground traffic (default is on for foreground and background)
Lease transfer enforcement set to shed (default is block_transfer_to)
Lease IO overload set to 0.3 (default is 0.5 in 23.1 and 0.3 in 23.2)

andrewbaptist · 2023-12-13T19:10:59Z

As part of this investigation, I've found a number of reasons for slowdowns after the node is shut down and restarted

Raft catchup traffic floods the disk
High scan rate against meta2 during lease and replica movement (both after the node is dead and after it comes back)
High snapshot transfer rate can slow down foreground traffic
Deletion for replicas that are moved off the node that is down cause compaction to focus on lower layers
Premature movement of leases to the node that is overloaded
AC slowing foreground traffic due to high goroutines and IO overload

The changes to the test in #116291 expose many of these issues. We are still far away from having no impact, but this test can be used as a benchmark to improve the situation over time.

In escalations we have seen different behaviors with more fill and longer outages using the local disk. This test fills with data for 2 hours before starting the outages. It attempts to mitigate the outage by disabling foreground AC and setting lease preference to shed. Epic: none Fixes: cockroachdb#115648 Release note: None

119783: roachtest: make the outage for kv/restart longer r=andrewbaptist a=kvoli In escalations we have seen different behaviors with more fill and longer outages using the local disk. `kv/restart/nodes=12` fills with data for 2 hours before stopping one node for 10 minutes and asserting that cluster throughput remains above 50% after rejoining Resolves: #115648 Release note: None 120894: server: sql activity handler refactor r=xinhaoz a=dhartunian This PR is a collection of 3 commits that each execute the same refactor on different functions in the `combined_statement_stats.go` file. The way stats retrieval works is that we have logic that determines automatically which of 3 different tables to query. These are generally referred to as "activity" (topK rollups), "persisted" (periodically aggregated stats across all nodes from in-memory data), and "combined" (union of data from in-memory and "persisted", aggregated on demand). Previously, each of these functions would make its own decision about which table to query based on whether data would be returned. The methods would each try "activity" -> "persisted" -> "combined" in that order until **something** would return and then they would return that data. Now, we make this determination prior to dispatching the request and use the table selection to inform the specific method necessary for the data request. This ensures consistency between the query that computes the denominator for "% of runtime" metrics, and the ones that return the individual statement stats. This PR is marked as a backport because it is a pre-requisite to enforcing pre-flight result size restrictions on this handler (#120443). Part of #120443 Epic: None Release note: None Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: David Hartunian <[email protected]>

kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv Anything in KV that doesn't belong in a more specific category. labels Dec 5, 2023

blathers-crl bot added the T-kv KV Team label Dec 9, 2023

kvoli added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team and removed T-kv KV Team O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs labels Dec 9, 2023

andrewbaptist self-assigned this Dec 12, 2023

andrewbaptist mentioned this issue Dec 12, 2023

roachtest: make the outage for kv/restart longer #116291

Closed

kvoli mentioned this issue Feb 29, 2024

roachtest: make the outage for kv/restart longer #119783

Merged

kvoli self-assigned this Feb 29, 2024

kvoli mentioned this issue Feb 29, 2024

kvserver: separate allocator lease execution from replication #118866

Closed

7 tasks

exalate-issue-sync bot unassigned andrewbaptist Mar 15, 2024

craig bot closed this as completed in 5c85ecb Mar 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

kvoli commented Dec 5, 2023 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Dec 11, 2023 •

edited

Loading

andrewbaptist commented Dec 12, 2023

andrewbaptist commented Dec 13, 2023

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

Comments

kvoli commented Dec 5, 2023 • edited by cockroach-jira-scripts Loading

andrewbaptist commented Dec 11, 2023 • edited Loading

andrewbaptist commented Dec 12, 2023

andrewbaptist commented Dec 13, 2023

kvoli commented Dec 5, 2023 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Dec 11, 2023 •

edited

Loading