-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for lock debugging #18796
Conversation
Being able to easily identify what lock has been allocated to a given Libpod object is only somewhat useful for debugging lock issues, but it's trivial to expose and I don't see any harm in doing so. Signed-off-by: Matt Heon <[email protected]>
This is a nice quality-of-life change that should help to debug situations where someone runs out of locks (usually when a bunch of unused volumes accumulate). Signed-off-by: Matt Heon <[email protected]>
This is a general debug command that identifies any lock conflicts that could lead to a deadlock. It's only intended for Libpod developers (while it does tell you if you need to run `podman system renumber`, you should never have to do that anyways, and the next commit will include a lot more technical info in the output that no one except a Libpod dev will want). Hence, hidden command, and only implemented for the local driver (recommend just running it by SSHing into a `podman machine` VM in the unlikely case it's needed by remote Podman). These conflicts should normally never happen, but having a command like this is useful for debugging deadlock conditions when they do occur. Signed-off-by: Matt Heon <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sample output from
If there was an actual conflict, it's be listed, and the output would note that |
😍❤️ $ id -u
1000
$ ./bin/podman system locks
No lock conflicts have been detected, system safe from deadlocks.
$ ./bin/podman info | grep -i FreeLocks
freeLocks: 1943
$ for i in {1..10000}; do podman volume create vol$i; done
[...]
Error: allocating lock for new volume: allocation failed; exceeded num_locks (2048)
$ ./bin/podman info | grep -i FreeLocks
freeLocks: 0 |
451a962
to
40eed7c
Compare
To debug a deadlock, we really want to know what lock is actually locked, so we can figure out what is using that lock. This PR adds support for this, using trylock to check if every lock on the system is free or in use. Will really need to be run a few times in quick succession to verify that it's not a transient lock and it's actually stuck, but that's not really a big deal. Signed-off-by: Matt Heon <[email protected]>
cmd/podman/system/locks.go
Outdated
if len(report.LockConflicts) > 0 { | ||
fmt.Printf("\nLock conflicts have been detected. Recommend immediate use of `podman system renumber` to resolve.\n\n") | ||
} else { | ||
fmt.Printf("\nNo lock conflicts have been detected, system safe from deadlocks.\n\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As much as I understand this wording I would not claim system safe from deadlocks
, this only checks for shm lock conflicts. We can still have ABBA deadlocks or any other deadlock between different kind of locks such as mutex, go channels, WaitGroups or even c/storage locks.
libpod/lock/shm/shm_lock.c
Outdated
for (i = 0; i < shm->num_bitmaps; i++) { | ||
// Short-circuit to catch fully-empty bitmaps quick. | ||
if (shm->locks[i].bitmap == 0) { | ||
free_locks += 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be s/32/sizeof(bitmap_t)/
? I mean it works but this would make it more clear where 32 is coming from.
libpod/lock/shm/shm_lock.c
Outdated
count++; | ||
} | ||
|
||
free_locks += 32 - count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
libpod/runtime.go
Outdated
locksArr, ok := locksInUse[lockNum] | ||
if ok { | ||
locksInUse[lockNum] = append(locksArr, ctrString) | ||
} else { | ||
locksInUse[lockNum] = []string{ctrString} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can really just be simplified to locksInUse[lockNum] = append(locksInUse[lockNum], ctrString)
, same for the other two branches for pods/volumes
The inspect format for `.LockNumber` needed to be documented. Signed-off-by: Matt Heon <[email protected]>
Restarted two tests that looked like flakes. Green otherwise. @containers/podman-maintainers PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/lgtm |
Deadlocks are among the most annoying parts of Podman to debug (second only to the intricacies of the REST attach endpoints, in my opinion). Part of this is that there wasn't really a good way to tell what was going on - all our locks are in-memory, so an strace just shows
futex
calls on inscrutable memory addresses (which blend into thefutex
calls that the Go runtime is doing on its own, making things extra fun).This PR attempts to remedy this in 3 ways:
podman inspect
- allows for quick spot checks to verify that numbers assigned look sanepodman info
- so we can easily see if the system has run entirely out of locks (usually because folks forget to prune volumes)podman system locks
, to check for potential deadlocks due to duplicate lock assignment and identify any locks that are currently in use. This is hidden because the output isn't really meant for users, but for developers attempting to debug a deadlock.