Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for lock debugging #18796

Merged
merged 5 commits into from
Jun 7, 2023

Conversation

mheon
Copy link
Member

@mheon mheon commented Jun 5, 2023

Deadlocks are among the most annoying parts of Podman to debug (second only to the intricacies of the REST attach endpoints, in my opinion). Part of this is that there wasn't really a good way to tell what was going on - all our locks are in-memory, so an strace just shows futex calls on inscrutable memory addresses (which blend into the futex calls that the Go runtime is doing on its own, making things extra fun).

This PR attempts to remedy this in 3 ways:

  • Add the number of the lock used by a container/pod/volume to the output of podman inspect - allows for quick spot checks to verify that numbers assigned look sane
  • Add the number of available locks to podman info - so we can easily see if the system has run entirely out of locks (usually because folks forget to prune volumes)
  • Add a new, hidden debug command, podman system locks, to check for potential deadlocks due to duplicate lock assignment and identify any locks that are currently in use. This is hidden because the output isn't really meant for users, but for developers attempting to debug a deadlock.
The number of free locks available is now displayed in `podman info` to help debug lock exhaustion scenarios.

mheon added 3 commits June 5, 2023 12:28
Being able to easily identify what lock has been allocated to a
given Libpod object is only somewhat useful for debugging lock
issues, but it's trivial to expose and I don't see any harm in
doing so.

Signed-off-by: Matt Heon <[email protected]>
This is a nice quality-of-life change that should help to debug
situations where someone runs out of locks (usually when a bunch
of unused volumes accumulate).

Signed-off-by: Matt Heon <[email protected]>
This is a general debug command that identifies any lock
conflicts that could lead to a deadlock. It's only intended for
Libpod developers (while it does tell you if you need to run
`podman system renumber`, you should never have to do that
anyways, and the next commit will include a lot more technical
info in the output that no one except a Libpod dev will want).
Hence, hidden command, and only implemented for the local driver
(recommend just running it by SSHing into a `podman machine` VM
in the unlikely case it's needed by remote Podman).

These conflicts should normally never happen, but having a
command like this is useful for debugging deadlock conditions
when they do occur.

Signed-off-by: Matt Heon <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 5, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2023
@mheon
Copy link
Member Author

mheon commented Jun 5, 2023

Sample output from podman system locks on a system that has nothing wrong, but is creating/removing a lot of containers:

No lock conflicts have been detected, system safe from deadlocks.

Lock 0 is presently being held

If there was an actual conflict, it's be listed, and the output would note that podman system renumber should be used. And any additional locks being held would be listed below the Lock 0 line.

@robbmanes
Copy link
Contributor

😍❤️

$ id -u
1000

$ ./bin/podman system locks
No lock conflicts have been detected, system safe from deadlocks.

$ ./bin/podman info | grep -i FreeLocks
  freeLocks: 1943

$ for i in {1..10000}; do podman volume create vol$i; done
[...]
Error: allocating lock for new volume: allocation failed; exceeded num_locks (2048)

$ ./bin/podman info | grep -i FreeLocks
  freeLocks: 0

@mheon mheon force-pushed the lock_debugging branch 2 times, most recently from 451a962 to 40eed7c Compare June 5, 2023 20:52
To debug a deadlock, we really want to know what lock is actually
locked, so we can figure out what is using that lock. This PR
adds support for this, using trylock to check if every lock on
the system is free or in use. Will really need to be run a few
times in quick succession to verify that it's not a transient
lock and it's actually stuck, but that's not really a big deal.

Signed-off-by: Matt Heon <[email protected]>
@mheon mheon force-pushed the lock_debugging branch from 40eed7c to 4fda793 Compare June 5, 2023 23:34
if len(report.LockConflicts) > 0 {
fmt.Printf("\nLock conflicts have been detected. Recommend immediate use of `podman system renumber` to resolve.\n\n")
} else {
fmt.Printf("\nNo lock conflicts have been detected, system safe from deadlocks.\n\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I understand this wording I would not claim system safe from deadlocks, this only checks for shm lock conflicts. We can still have ABBA deadlocks or any other deadlock between different kind of locks such as mutex, go channels, WaitGroups or even c/storage locks.

for (i = 0; i < shm->num_bitmaps; i++) {
// Short-circuit to catch fully-empty bitmaps quick.
if (shm->locks[i].bitmap == 0) {
free_locks += 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be s/32/sizeof(bitmap_t)/? I mean it works but this would make it more clear where 32 is coming from.

count++;
}

free_locks += 32 - count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment on lines 1209 to 1214
locksArr, ok := locksInUse[lockNum]
if ok {
locksInUse[lockNum] = append(locksArr, ctrString)
} else {
locksInUse[lockNum] = []string{ctrString}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can really just be simplified to locksInUse[lockNum] = append(locksInUse[lockNum], ctrString), same for the other two branches for pods/volumes

test/e2e/info_test.go Show resolved Hide resolved
The inspect format for `.LockNumber` needed to be documented.

Signed-off-by: Matt Heon <[email protected]>
@mheon mheon force-pushed the lock_debugging branch from 987bd66 to 944673c Compare June 6, 2023 15:05
@mheon
Copy link
Member Author

mheon commented Jun 6, 2023

Restarted two tests that looked like flakes. Green otherwise.

@containers/podman-maintainers PTAL

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Jun 7, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2023
@openshift-merge-robot openshift-merge-robot merged commit 76f4571 into containers:main Jun 7, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants