Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: fix DistSender circuit breaker tripped metrics leak #121142

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 26, 2024

If a tripped circuit breaker is GCed, the tripped metric will consider it tripped forever. This patch untrips the breaker during GC, taking care to properly shut down and synchronize with any concurrent probes to avoid metrics leaks.

Resolves #121030.
Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Mar 26, 2024
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 26, 2024 19:48
Copy link

blathers-crl bot commented Mar 26, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: though it would be nice to have a test that exercises this logic.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 361 at r1 (raw file):

	// closedC is closed when the circuit breaker has been GCed. This will shut
	// down a running probe, and prevent new probes from launching.
	closedC chan struct{}

Should we be allocating this channel in newReplicaCircuitBreaker?


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 938 at r1 (raw file):

since it would also use an untripped breaker if it arrived after GC

👍


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 941 at r1 (raw file):

	if r.isClosed() {
		if r.isTripped() {
			r.breaker.Reset()

It's ok for an EventHandler function to call back into Reset, right? This won't cause any deadlocks?

@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-metrics-leak branch from c91274a to 60db867 Compare March 26, 2024 20:42
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, all of this needs testing. I'm going to add some tomorrow, but will likely need a bit more time to add comprehensive testing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 361 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we be allocating this channel in newReplicaCircuitBreaker?

Yes, forgot to add that.


pkg/kv/kvclient/kvcoord/dist_sender_circuit_breaker.go line 941 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's ok for an EventHandler function to call back into Reset, right? This won't cause any deadlocks?

Yes, afaict. OnProbeDone is called without holding the circuit breaker mutex, before the probe is recorded as shut down. This shouldn't be any different from the probe itself calling report(nil).

@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-metrics-leak branch from 60db867 to 9ef08ce Compare March 26, 2024 21:11
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist)

@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-metrics-leak branch from 9ef08ce to 90515f0 Compare March 26, 2024 23:17
@erikgrinaker
Copy link
Contributor Author

I've seen a couple of CI failures in TestTenantLogicCCL_crdb_internal_tenant, unclear if it's related. Doing another run.

@erikgrinaker
Copy link
Contributor Author

CI is passing now, going for it and catching up in the nightlies.

bors r+

If a tripped circuit breaker is GCed, the `tripped` metric will consider
it tripped forever. This patch untrips the breaker during GC, taking
care to properly shut down and synchronize with any concurrent probes to
avoid metrics leaks.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the distsender-circuit-breaker-metrics-leak branch from 90515f0 to 91e7278 Compare March 27, 2024 00:49
@craig
Copy link
Contributor

craig bot commented Mar 27, 2024

Canceled.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 6e8d2cd into cockroachdb:master Mar 27, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvcoord: fix DistSender circuit breaker tripped metrics leak
3 participants