Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: reap DistSender circuit breaker probes during GC #120946

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 24, 2024

kvcoord: split out DistSender circuit breaker probeLoop

This separates the goroutine management from the probe loop logic.

kvcoord: reap DistSender circuit breaker probes during GC

This prevents probe leaks in case they are long-running, and will allow removing GCed circuit breakers from the tripped metrics.

kvcoord: remove GCed DistSender circuit breakers from tripped metrics

Otherwise, these will register as tripped forever.

Resolves #121030.
Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Mar 24, 2024
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 24, 2024 01:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Mar 24, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker
Copy link
Contributor Author

I'm not thrilled about the amount of synchronization needed to reap probes, but it gets the job done and isn't too bad I suppose. Open to suggestions on how to simplify.

@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-reap-probes branch from 3ae54c8 to 86d86ae Compare March 24, 2024 19:13
@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-reap-probes branch 2 times, most recently from 3b98315 to c25dd6e Compare March 25, 2024 13:32
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 671 at r5 (raw file):

			err := r.sendProbe(ctx, transport)

			// If the context failed, we're shutting down. Just exit the probe without

"If the context with no timeout failed,"


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 781 at r5 (raw file):

		// either the remote node or the local node has been decommissioned. We
		// rely on RPC circuit breakers to fail fast for these, so there's no point
		// in us probing individual replicas. Stop probing.

"Stop probing and untrip the breaker if it was tripped."


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 282 at r8 (raw file):

					// replicas. But it can happen if we race with a probe launch, or if
					// probes run longer than the GC threshold. The map removal above will
					// prevent new probes being launched (checked by launchProbe).

"being launched for this same ReplicaCircuitBreaker"


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 353 at r8 (raw file):

		cancelFns map[*kvpb.BatchRequest]context.CancelCauseFunc

		// reapProbe, if non-nil, can be called to synchronously cancel a running

Is it important that this cancellation is synchronous? I'd be concerned that this allows a circuit breaker that's either slow to respond to context cancellation, or that's broken entirely, to stall the gc loop. This change would be more obviously correct if the reapProbe function was non-blocking. We could then also call it under lock in ReplicaCircuitBreaker.reapProbe and set mu.reapProbe to nil right after, to ensure it is only called once.

EDIT: I guess you want this to be synchronous so that you can perform the cb.isTripped check after without risk of racing with a probe or double accounting in the ReplicasTripped metric.


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 868 at r8 (raw file):

}

// reapProbe synchronously reaps the currently running probe, if any.

Consider renaming to maybeReapProbe so that it's contract is clearer and it can't be confused wit the reapProbe field.

@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-reap-probes branch from c25dd6e to 9d1f831 Compare March 25, 2024 17:35
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed.

Since there may still be some uncertainty about whether this is the best approach to solve this problem, I wrote up a GA blocker issue for it (#121030), so we don't have to rush this in before the branch cut. I also split out the semi-unrelated changes to separate PRs that we can merge before the branch cut: #121028 and #121029.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 353 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it important that this cancellation is synchronous? I'd be concerned that this allows a circuit breaker that's either slow to respond to context cancellation, or that's broken entirely, to stall the gc loop. This change would be more obviously correct if the reapProbe function was non-blocking. We could then also call it under lock in ReplicaCircuitBreaker.reapProbe and set mu.reapProbe to nil right after, to ensure it is only called once.

EDIT: I guess you want this to be synchronous so that you can perform the cb.isTripped check after without risk of racing with a probe or double accounting in the ReplicasTripped metric.

Yes, exactly -- the whole reason for this synchronization is to avoid GC and the probe racing to update the metric, so we need to know that the probe has shut down first (assuming one is even running, which it may or may not be).

@erikgrinaker
Copy link
Contributor Author

Thought of a way to possibly simplify this. Let me submit a new take tomorrow.

This separates the goroutine management from the probe loop logic.

Epic: none
Release note: None
This prevents probe leaks in case they are long-running, and will allow
removing GCed circuit breakers from the tripped metrics.

Epic: none
Release note: None
Otherwise, these will register as tripped forever.

Epic: none
Release note: None
@erikgrinaker
Copy link
Contributor Author

See #121142 for a simpler alternative. I think I prefer it.

@erikgrinaker
Copy link
Contributor Author

Closing in favor of #121142.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvcoord: fix DistSender circuit breaker tripped metrics leak
3 participants