-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: leaked replica mu #106568
Labels
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-kv
KV Team
Comments
tbg
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
labels
Jul 11, 2023
tbg
added
the
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
label
Jul 11, 2023
Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This was referenced Jul 11, 2023
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 12, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by cockroachdb#104657 as well, since after that PR we would never see a nil `status` here. Touches cockroachdb#106568
craig bot
pushed a commit
that referenced
this issue
Jul 13, 2023
106574: kvserver: avoid leaked replica mutex during delegated snapshot r=erikgrinaker a=tbg Prompted by #106568, I went through all callers to `repl.mu.Lock` and `repl.mu.RLock` trying to find places where we're leaking the mutex on certain return paths. I found some, including one that at least seems to possibly explain what we saw in #106568. I also looked into the possibility of panics under a held lock being recovered from at higher levels in the stack, in particular this code: https://github.com/cockroachdb/cockroach/blob/81569be4aeb61620d2fd51c41b416f7d0986698e/pkg/sql/colexecerror/error.go#L27-L30 However, it seems sufficiently restricted to recovering only from errors it knows to be safe to recover from, and in particular it doesn't look like it will ever recover from any panic in kvserver. I will say that this looks a little brittle, though I don't see a much better way of doing what this code is trying to achieve. Release note: None Epic: None Co-authored-by: Tobias Grieger <[email protected]>
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 14, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by cockroachdb#104657 as well, since after that PR we would never see a nil `status` here. Touches cockroachdb#106568
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 14, 2023
This one was possible hit. Interestingly, it would be "accidentally" fixed by cockroachdb#104657 as well, since after that PR we would never see a nil `status` here. Touches cockroachdb#106568
Fixed, presumably, in #106574. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
T-kv
KV Team
Describe the problem
In two instances, we saw a leaked replica mutex:
In both cases, the goroutine acquiring the mutex seems to have exited without releasing it. The second instance underwent an extensive attempt to reproduce the problem (~8k runs over a week) but the issue did not reoccur.
#106254 has a "deadlock detector" that also prints the stack trace of the mutex acquisition. However, it is likely too expensive to be always-on, especially for a hot mutex like Replica.mu.
Another angle could be #105366, i.e. proving through static analysis that all acquisitions are defer-unlocked (and thus in a deadlock scenario, the lock holder would still be around).
In my view we ought to only use a provably safe unlock pattern, so option 2 seems appealing, especially given the rarity of the bug.
Jira issue: CRDB-29618
The text was updated successfully, but these errors were encountered: