Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: leaked replica mu #106568

Closed
tbg opened this issue Jul 11, 2023 · 2 comments
Closed

kvserver: leaked replica mu #106568

tbg opened this issue Jul 11, 2023 · 2 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jul 11, 2023

Describe the problem

In two instances, we saw a leaked replica mutex:

In both cases, the goroutine acquiring the mutex seems to have exited without releasing it. The second instance underwent an extensive attempt to reproduce the problem (~8k runs over a week) but the issue did not reoccur.

#106254 has a "deadlock detector" that also prints the stack trace of the mutex acquisition. However, it is likely too expensive to be always-on, especially for a hot mutex like Replica.mu.

Another angle could be #105366, i.e. proving through static analysis that all acquisitions are defer-unlocked (and thus in a deadlock scenario, the lock holder would still be around).

In my view we ought to only use a provably safe unlock pattern, so option 2 seems appealing, especially given the rarity of the bug.

Jira issue: CRDB-29618

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Jul 11, 2023
@tbg tbg added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 11, 2023

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nicktrav nicktrav added the branch-master Failures and bugs on the master branch. label Jul 11, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jul 12, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by
cockroachdb#104657 as well, since after that
PR we would never see a nil `status` here.

Touches cockroachdb#106568
craig bot pushed a commit that referenced this issue Jul 13, 2023
106574: kvserver: avoid leaked replica mutex during delegated snapshot r=erikgrinaker a=tbg

Prompted by #106568, I went
through all callers to `repl.mu.Lock` and `repl.mu.RLock` trying to find places
where we're leaking the mutex on certain return paths.

I found some, including one that at least seems to possibly explain what we saw in #106568.

I also looked into the possibility of panics under a held lock being recovered from
at higher levels in the stack, in particular this code:

https://github.com/cockroachdb/cockroach/blob/81569be4aeb61620d2fd51c41b416f7d0986698e/pkg/sql/colexecerror/error.go#L27-L30

However, it seems sufficiently restricted to recovering only from errors it knows to
be safe to recover from, and in particular it doesn't look like it will ever recover from
any panic in kvserver. I will say that this looks a little brittle, though I don't see a much
better way of doing what this code is trying to achieve.

Release note: None
Epic: None

Co-authored-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Jul 14, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by
cockroachdb#104657 as well, since after that
PR we would never see a nil `status` here.

Touches cockroachdb#106568
tbg added a commit to tbg/cockroach that referenced this issue Jul 14, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by
cockroachdb#104657 as well, since after that
PR we would never see a nil `status` here.

Touches cockroachdb#106568
@tbg
Copy link
Member Author

tbg commented Jul 17, 2023

Fixed, presumably, in #106574.

@tbg tbg closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants