-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: reap DistSender circuit breaker probes during GC #120946
kvcoord: reap DistSender circuit breaker probes during GC #120946
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I'm not thrilled about the amount of synchronization needed to reap probes, but it gets the job done and isn't too bad I suppose. Open to suggestions on how to simplify. |
3ae54c8
to
86d86ae
Compare
3b98315
to
c25dd6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 671 at r5 (raw file):
err := r.sendProbe(ctx, transport) // If the context failed, we're shutting down. Just exit the probe without
"If the context with no timeout failed,"
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 781 at r5 (raw file):
// either the remote node or the local node has been decommissioned. We // rely on RPC circuit breakers to fail fast for these, so there's no point // in us probing individual replicas. Stop probing.
"Stop probing and untrip the breaker if it was tripped."
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 282 at r8 (raw file):
// replicas. But it can happen if we race with a probe launch, or if // probes run longer than the GC threshold. The map removal above will // prevent new probes being launched (checked by launchProbe).
"being launched for this same ReplicaCircuitBreaker"
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 353 at r8 (raw file):
cancelFns map[*kvpb.BatchRequest]context.CancelCauseFunc // reapProbe, if non-nil, can be called to synchronously cancel a running
Is it important that this cancellation is synchronous? I'd be concerned that this allows a circuit breaker that's either slow to respond to context cancellation, or that's broken entirely, to stall the gc loop. This change would be more obviously correct if the reapProbe
function was non-blocking. We could then also call it under lock in ReplicaCircuitBreaker.reapProbe
and set mu.reapProbe
to nil right after, to ensure it is only called once.
EDIT: I guess you want this to be synchronous so that you can perform the cb.isTripped
check after without risk of racing with a probe or double accounting in the ReplicasTripped
metric.
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 868 at r8 (raw file):
} // reapProbe synchronously reaps the currently running probe, if any.
Consider renaming to maybeReapProbe
so that it's contract is clearer and it can't be confused wit the reapProbe
field.
c25dd6e
to
9d1f831
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed.
Since there may still be some uncertainty about whether this is the best approach to solve this problem, I wrote up a GA blocker issue for it (#121030), so we don't have to rush this in before the branch cut. I also split out the semi-unrelated changes to separate PRs that we can merge before the branch cut: #121028 and #121029.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go
line 353 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it important that this cancellation is synchronous? I'd be concerned that this allows a circuit breaker that's either slow to respond to context cancellation, or that's broken entirely, to stall the gc loop. This change would be more obviously correct if the
reapProbe
function was non-blocking. We could then also call it under lock inReplicaCircuitBreaker.reapProbe
and setmu.reapProbe
to nil right after, to ensure it is only called once.EDIT: I guess you want this to be synchronous so that you can perform the
cb.isTripped
check after without risk of racing with a probe or double accounting in theReplicasTripped
metric.
Yes, exactly -- the whole reason for this synchronization is to avoid GC and the probe racing to update the metric, so we need to know that the probe has shut down first (assuming one is even running, which it may or may not be).
Thought of a way to possibly simplify this. Let me submit a new take tomorrow. |
This separates the goroutine management from the probe loop logic. Epic: none Release note: None
This prevents probe leaks in case they are long-running, and will allow removing GCed circuit breakers from the tripped metrics. Epic: none Release note: None
Otherwise, these will register as tripped forever. Epic: none Release note: None
9d1f831
to
50b9987
Compare
See #121142 for a simpler alternative. I think I prefer it. |
Closing in favor of #121142. |
kvcoord: split out DistSender circuit breaker
probeLoop
This separates the goroutine management from the probe loop logic.
kvcoord: reap DistSender circuit breaker probes during GC
This prevents probe leaks in case they are long-running, and will allow removing GCed circuit breakers from the tripped metrics.
kvcoord: remove GCed DistSender circuit breakers from tripped metrics
Otherwise, these will register as tripped forever.
Resolves #121030.
Epic: none
Release note: None