-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: liveness: improve disk probes during node liveness updates #81476
release-22.1: liveness: improve disk probes during node liveness updates #81476
Conversation
0b54478
to
6da8e80
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but maybe a week or two of bake on master?
Yep, for sure. |
When `NodeLiveness` updates the liveness record (e.g. during heartbeats), it first does a noop sync write to all disks. This ensures that a node with a stalled disk will fail to maintain liveness and lose its leases. However, this sync write could block indefinitely, and would not respect the caller's context, which could cause the caller to stall rather than time out. This in turn could lead to stalls higher up in the stack, in particular with lease acquisitions that do a synchronous heartbeat. This patch does the sync write in a separate goroutine in order to respect the caller's context. The write operation itself will not (can not) respect the context, and may thus leak a goroutine. However, concurrent sync writes will coalesce onto an in-flight write. Additionally, this runs the sync writes in parallel across all disks, since we can now trivially do so. This may be advantageous on nodes with many stores, to avoid spurious heartbeat failures under load. Release note (bug fix): Disk write probes during node liveness heartbeats will no longer get stuck on stalled disks, instead returning an error once the operation times out. Additionally, disk probes now run in parallel on nodes with multiple stores.
Release note: None
This patch runs the sync disk write during node heartbeats in a stopper task. The write is done in a goroutine, so that we can respect the caller's context cancellation (even though the write itself won't). However, this could race with engine shutdown when stopping the node, violating the Pebble contract and triggering the race detector. Running it as a stopper task will cause the node to wait for the disk write to complete before closing the engine. Of course, if the disk stalls then node shutdown will now never complete. This is very unfortunate, since stopping the node is often the only mitigation to recover stuck ranges with stalled disks. This is mitigated by Pebble panic'ing the node on stalled disks, and Kubernetes and other orchestration tools killing the process after some time. Release note: None
3e03cc0
to
97054ae
Compare
/cc @cockroachdb/release
When
NodeLiveness
updates the liveness record (e.g. duringheartbeats), it first does a noop sync write to all disks. This ensures
that a node with a stalled disk will fail to maintain liveness and lose
its leases.
However, this sync write could block indefinitely, and would not respect
the caller's context, which could cause the caller to stall rather than
time out. This in turn could lead to stalls higher up in the stack,
in particular with lease acquisitions that do a synchronous heartbeat.
This patch does the sync write in a separate goroutine in order to
respect the caller's context. The write operation itself will not
(can not) respect the context, and may thus leak a goroutine. However,
concurrent sync writes will coalesce onto an in-flight write.
Additionally, this runs the sync writes in parallel across all disks,
since we can now trivially do so. This may be advantageous on nodes with
many stores, to avoid spurious heartbeat failures under load.
Touches #81100.
Release note (bug fix): Disk write probes during node liveness
heartbeats will no longer get stuck on stalled disks, instead returning
an error once the operation times out. Additionally, disk probes now run
in parallel on nodes with multiple stores.
Release justification: cluster availability improvement.