Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

Closed
kvoli opened this issue Dec 5, 2023 · 3 comments · Fixed by #119783
Closed

roachtest: create test where node rejoins cluster after being down for multiple hours #115648

kvoli opened this issue Dec 5, 2023 · 3 comments · Fixed by #119783
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Dec 5, 2023

Is your feature request related to a problem? Please describe.

When a cockroach node rejoins the cluster, its (remaining) replicas need to be caught up. The catchup can cause IO saturation leading to increased tail latency.

Describe the solution you'd like

Write a roachtest, Similar to kv/restart/nodes=12, which brings nodes down for 10 minutes. In this test, a node should remain down for at least an hour and contain a large enough number of replicas (bytes) that it will not have all its replicas replaced within the hour.

The test should assert or observe the impact of the rejoining node on cluster throughput, where a moderately write heavy workload will run throughout.

See earlier work done in #96521 and catch up pacing proposed in #98710.

Jira issue: CRDB-34142

@kvoli kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv Anything in KV that doesn't belong in a more specific category. labels Dec 5, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 9, 2023
@kvoli kvoli added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team and removed T-kv KV Team O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs labels Dec 9, 2023
@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Dec 11, 2023

Some notes about recreating this manually
Create cluster:

roachprod create -n 13 $CLUSTER
roachprod stage $CLUSTER release v23.1.12 
roachprod start $CLUSTER:1-12\

Install docker

roachprod ssh $CLUSTER:13
sudo apt  install docker.io

Init tpc-e - takes ~1 hour

roachprod ssh $CLUSTER:13 "sudo docker run cockroachdb/tpc-e:latest --init -c25000 --hosts $(roachprod ip $CLUSTER:1)"

Run test against first 11 nodes

roachprod ssh $CLUSTER:13 "sudo docker run cockroachdb/tpc-e:latest -c25000 -d30m $(roachprod ip $CLUSTER:1-11 | sed s/^/--hosts=/ | tr '\n' ' ')"
  • Wait 5 minutes for the workload,
  • Gracefully stop node 12
  • Wait 5 more minutes
  • Start node 12
sleep 300
roachprod stop $CLUSTER:12 --sig 15
sleep 300
roachprod start $CLUSTER:12

@andrewbaptist
Copy link
Collaborator

I have some good manual reproduction of this and will turn into a roachtest tomorrow.
The set of commands is:

roachprod create -n 13 $CLUSTER --local-ssd
roachprod stage $CLUSTER release v23.1.12 
roachprod start $CLUSTER:1-12

roachprod ssh $CLUSTER:1 "./cockroach workload init kv --splits 5000"
roachprod ssh $CLUSTER:13 "./cockroach workload run kv {pgurl:1-11} --read-percent 10 --max-block-bytes 4096 --concurrency 256 --max-rate 12000"

# Wait for ~2 hours to let sufficient fill in the system. The workload continues during the following steps.

roachprod stop $CLUSTER:12 --sig 15
sleep 600
roachprod start $CLUSTER:12

22.2.13 default settings:
image

23.1.13 default settings:
image

23.1.13 with a change to three setting:

  • AC disabled for foreground traffic (default is on for foreground and background)
  • Lease transfer enforcement set to shed (default is block_transfer_to)
  • Lease IO overload set to 0.3 (default is 0.5 in 23.1 and 0.3 in 23.2)

image

@andrewbaptist
Copy link
Collaborator

As part of this investigation, I've found a number of reasons for slowdowns after the node is shut down and restarted

  1. Raft catchup traffic floods the disk
  2. High scan rate against meta2 during lease and replica movement (both after the node is dead and after it comes back)
  3. High snapshot transfer rate can slow down foreground traffic
  4. Deletion for replicas that are moved off the node that is down cause compaction to focus on lower layers
  5. Premature movement of leases to the node that is overloaded
  6. AC slowing foreground traffic due to high goroutines and IO overload

The changes to the test in #116291 expose many of these issues. We are still far away from having no impact, but this test can be used as a benchmark to improve the situation over time.

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Dec 15, 2023
In escalations we have seen different behaviors with more fill and
longer outages using the local disk. This test fills with data for 2
hours before starting the outages. It attempts to mitigate the outage by
disabling foreground AC and setting lease preference to shed.

Epic: none
Fixes: cockroachdb#115648

Release note: None
@kvoli kvoli self-assigned this Feb 29, 2024
craig bot pushed a commit that referenced this issue Mar 28, 2024
119783: roachtest: make the outage for kv/restart longer r=andrewbaptist a=kvoli

In escalations we have seen different behaviors with more fill and longer outages using the local disk. `kv/restart/nodes=12` fills with data for 2 hours before stopping one node for 10 minutes and asserting that cluster throughput remains above 50% after rejoining

Resolves: #115648
Release note: None

120894: server: sql activity handler refactor r=xinhaoz a=dhartunian

This PR is a collection of 3 commits that each execute the same refactor on different functions in the `combined_statement_stats.go` file.

The way stats retrieval works is that we have logic that determines automatically which of 3 different tables to query. These are generally referred to as "activity" (topK rollups), "persisted" (periodically aggregated stats across all nodes from in-memory data), and "combined" (union of data from in-memory and "persisted", aggregated on demand).

Previously, each of these functions would make its own decision about which table to query based on whether data would be returned. The methods would each try "activity" -> "persisted" -> "combined" in that order until **something** would return and then they would return that data.

Now, we make this determination prior to dispatching the request and use the table selection to inform the specific method necessary for the data request. This ensures consistency between the query that computes the denominator for "% of runtime" metrics, and the ones that return the individual statement stats.

This PR is marked as a backport because it is a pre-requisite to enforcing pre-flight result size restrictions on this handler (#120443).

Part of #120443
Epic: None
Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
@craig craig bot closed this as completed in 5c85ecb Mar 28, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Closed
2 participants