Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock between multiple container creates with similar volumes #20313

Closed
UniversalSuperBox opened this issue Oct 9, 2023 · 14 comments · Fixed by #20329
Closed

Deadlock between multiple container creates with similar volumes #20313

UniversalSuperBox opened this issue Oct 9, 2023 · 14 comments · Fixed by #20329
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@UniversalSuperBox
Copy link

UniversalSuperBox commented Oct 9, 2023

Don't be fooled by this issue description! The actual issue was setupContainer deadlocking with itself across multiple simultaneous container creates which used the same volumes in different orders on the commandline.

Issue Description

Podman can deadlock when two (or more) containers are being set up using the same volume at the same time. When this condition is triggered, most podman commands hang, including ps, info, volumes, and basically anything else of any use. The deadlock can be cleared by ending the hanged podman processes, but of course this is not an ideal solution.

For example, podman ps hangs with the following last lines:

DEBU[0000] Using OCI runtime "/usr/bin/crun"            
INFO[0000] Setting parallel job count to 13

(full log of that run can be found at https://gist.github.com/UniversalSuperBox/162ef8e6c76bee4b8d3cbae1709efa9e)

Upstream logging indicates that the first container is stuck in the following loop:

// Request a mount of all named volumes
for _, v := range c.config.NamedVolumes {
vol, err := c.mountNamedVolume(v, mountPoint)
if err != nil {
return "", err
}
defer func() {
if deferredErr == nil {
return
}
vol.lock.Lock()
if err := vol.unmount(false); err != nil {
logrus.Errorf("Unmounting volume %s after error mounting container %s: %v", vol.Name(), c.ID(), err)
}
vol.lock.Unlock()
}()

Meanwhile, the second container is stuck in the following loop. I know this from my added logging, where AAAA done setting up shm comes directly before the indicated lines and AAAA locked named volumes comes directly after:

// Lock all named volumes we are adding ourself to, to ensure we can't
// use a volume being removed.
volsLocked := make(map[string]bool)
for _, namedVol := range ctrNamedVolumes {
toLock := namedVol
// Ensure that we don't double-lock a named volume that is used
// more than once.
if volsLocked[namedVol.Name()] {
continue
}
volsLocked[namedVol.Name()] = true
toLock.lock.Lock()
defer toLock.lock.Unlock()
}

The latter code has been around since e563f41 in 2019 and while it did have a small update in 0f637e0 to prevent this deadlock in a single-container scenario, there is no accounting for multiple podman binaries (or maybe even multiple requests to the same daemon binary?) trying to take the lock at the same time.

This seems like a classic deadlock. The first container is probably waiting on the second container to release a lock on one of the named volumes. The second is waiting on the first. My first reaction to this says "only one process should be able to mutate any named volume at a time," I'm hoping that someone with more experience with the code has a better idea.

I know I'm using podman 4.6.0, but I don't believe any changes between 4.6.0 and the current 4.7.1 release would mitigate the issue.

Steps to reproduce the issue

  1. Run (that is, podman run) two containers which use the same named volume at the same time
  2. Repeat this step until the stars align

Describe the results you received

Neither container ends up running. With debug logging (and a bit more logging added by yours truly), one container's log ends at:

Oct 08 13:30:08 [hostname] scheduled-repo-updates[1038557]: time="2023-10-08T13:30:08-07:00" level=debug msg="Mounted container \"f28c1ebae6bebfa844ce9a02b990f176e26fb59a99a1c021b1fd10fd3f9828c7\" at \"/opt/aptly_containers/.local/share/containers/storage/vfs/dir/73386034cc0d7c37ac203c2ccb17303ef46839fd3b558d81015cdaefeadb0423\""
Oct 08 13:30:08 [hostname] scheduled-repo-updates[1038557]: time="2023-10-08T13:30:08-07:00" level=debug msg="Going to mount named volume aptly_bulk"

(full log is SideA.log in this gist)

The other container's log ends:

Oct 08 13:35:00 [hostname] britney2[1039485]: time="2023-10-08T13:35:00-07:00" level=debug msg="Container \"ba9264d67f0c3d7fb91c3facdf510d9ce05a718ec9bb770b22fc72c197289369\" has run directory \"/tmp/containers-user-2002/containers/vfs-containers/ba9264d67f0c3d7fb91c3facdf510d9ce05a718ec9bb770b22fc72c197289369/userdata\""
[... more of my own logs...]
Oct 08 13:35:00 [hostname] britney2[1039485]: time="2023-10-08T13:35:00-07:00" level=debug msg="AAAA done setting up logging"
Oct 08 13:35:00 [hostname] britney2[1039485]: time="2023-10-08T13:35:00-07:00" level=debug msg="AAAA done setting up shm"

(full log is SideB.log in this gist)

Describe the results you expected

Both containers run to completion

podman info output

$ podman info
host:
  arch: amd64
  buildahVersion: 1.31.0
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon_100:2.1.7-1_amd64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: f633919178f6c8ee4fb41b848a056ec33f8d707d'
  cpuUtilization:
    idlePercent: 98.81
    systemPercent: 0.39
    userPercent: 0.8
  cpus: 4
  databaseBackend: boltdb
  distribution:
    codename: focal
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  freeLocks: 2005
  hostname: [hostname]
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 2002
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 2002
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
  kernel: 5.4.0-163-generic
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 24980148224
  memTotal: 67426156544
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns: {}
    package: containernetworking-plugins_0.8.5-1_amd64
    path: /usr/lib/cni
  ociRuntime:
    name: crun
    package: crun_100:1.8.6-1_amd64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/2002/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/2002/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_AUDIT_WRITE,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_MKNOD,CAP_NET_BIND_SERVICE,CAP_NET_RAW,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns_0.4.3-1_amd64
    version: |-
      slirp4netns version 0.4.3
      commit: 2244b9b6461afeccad1678fac3d6e478c28b4ad6
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 306h 54m 14.00s (Approximately 12.75 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: docker.io
    MirrorByDigestOnly: false
    Mirrors:
    - Insecure: false
      Location: [elided]
      PullFromMirror: ""
    - Insecure: false
      Location: [elided]
      PullFromMirror: ""
    Prefix: docker.io
    PullFromMirror: ""
  search:
  - docker.io
store:
  configFile: /opt/aptly_containers/.config/containers/storage.conf
  containerStore:
    number: 11
    paused: 0
    running: 3
    stopped: 8
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /opt/aptly_containers/.local/share/containers/storage
  graphRootAllocated: 2162308448256
  graphRootUsed: 421985132544
  graphStatus: {}
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 22
  runRoot: /tmp/containers-user-2002/containers
  transientStore: false
  volumePath: /opt/aptly_containers/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.0
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.21.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

No

Additional environment details

No response

Additional information

No response

@UniversalSuperBox UniversalSuperBox added the kind/bug Categorizes issue or PR as related to a bug. label Oct 9, 2023
@rhatdan
Copy link
Member

rhatdan commented Oct 9, 2023

@mheon PTAL

@mheon
Copy link
Member

mheon commented Oct 9, 2023

This is strange. mountNamedVolume locks one volume at a time, releasing the lock immediately after mounting, while container creation locks every volume the container is attached to. I would not expect a deadlock in this scenario; mountNamedVolume can't have more than one lock at a time, so it will finish mounting a volume, unlock, try and lock the next volume, and hang until container creation is done, while creation should then grab the lock that mountNamedVolume was holding and proceed without issue. We've seen similar issues before around pod removal, but only when one process locked every container and the other locked more than one container (there, it was an ordering issue - one process locks A first then tries B, the other locks B then tries A). I can't see that happening here.

@vrothberg @Luap99 Mind taking a look here, in case my brain is failing me?

@UniversalSuperBox
Copy link
Author

It’s always possible that I’ve misidentified the cause of the hang, but it does still hang somewhere near the listed code. I can add further logging if you like or run a further debug build, but catching this problem is maddeningly rare.

Spitballing: does ‘defer’ cause all of the deferred functions to be called upon leaving the nearest closure? Does each for loop iteration count as leaving the closure or do all the deferred funcs occur as quickly as possible after leaving the loop (in separate threads? Eh? Eh?)

@vrothberg
Copy link
Member

Thanks for reporting, @UniversalSuperBox !

Can you share the stack traces of the deadlocked processes? You can get them by sending SIGABRT to the podman process. This will dump the stack trace.

I currently do not see how an ABBA deadlock can happen between the two. But I am on my first mug of coffee and may miss something. Stack traces will help prevent speculation.

@Luap99
Copy link
Member

Luap99 commented Oct 10, 2023

It isn't what you see per your logs but one big problem is that we can definitely deadlock is setupContainer() when locking the volumes. This does a simple loop over an (AFAICT unsorted!) array while locking each volume, I think the order is how it was set on the cli. So when two containers are created at the same time with a different order of volumes this can deadlock if both proccesses end up in this loop at the same time. IMO locking like this in a loop is with rare exception never a good idea as it easily ends up deadlocking.

However I think there is something else going on in your case so the stack traces of both processes should help to see where the hang is exactly.

@mheon
Copy link
Member

mheon commented Oct 10, 2023

I am starting to wonder if locking all volumes on container create is strictly necessary. The database is robust enough to ensure that we cannot add to a volume that is already gone, so I am thinking we can drop the volume locks entirely and rely on the database to ensure safety in that case.

@UniversalSuperBox
Copy link
Author

UniversalSuperBox commented Oct 10, 2023

The plot thickens... I haven't been able to reproduce this problem with two instances of Podman, only when three are running. This could be because I'm super unlucky, but I've had two instances running for some twenty minutes without hitting the deadlock.

I've attempted to retrieve stacktraces by running killall -s STOP podman (thanks @Luap99) then killall -s ABRT podman. However, it seems that the deadlock was cleared and execution continued before Go acted on the signal (you can see that after the point where I had to type fg to resume and wait on the processes). I might need to figure out how to attach a debugger to all of the processes to get their in situ stacktraces.

https://gist.github.com/UniversalSuperBox/75564012d098d0f4c79bcf7d5dc878f6

Worth noting that I'm running Podman 4.7.0 inside podman machine for this test, so the podman info block in the OP isn't accurate. The issue seems the same, though, so we have that going for us.

@Luap99
Copy link
Member

Luap99 commented Oct 10, 2023

I try to take a look tomorrow at the stack traces/

as for a debugger I use https://github.com/go-delve/delve, then just dlv attach <pid> and then type grs -t as command in the debugger. That should result in nice readable stack traces.

@UniversalSuperBox
Copy link
Author

UniversalSuperBox commented Oct 10, 2023

It was a little fiddly to get Podman's debug symbols and delve installed in podman-machine, but I got there in the end.

Stack traces, and the commands used to get the system to deadlock, are here: https://gist.github.com/UniversalSuperBox/1e855b07bdec0f4d79f4c117c09da353

Potentially worth noting, "Aside" which is the "odd one out" is in libpod.(*Container).prepare. It does not respond to any input or SIGINT. The other two respond to a SIGINT with INFO[0880] Received shutdown signal "interrupt", terminating!, but they hang again after and do not respond to further SIGINT.

@mheon
Copy link
Member

mheon commented Oct 10, 2023

After thinking about it more, I'm now confident that we can remove the volume locks from runtime_ctr.go without a serious issue. The worst scenario I can come up with is a podman volume rm racing against a podman run -v that uses the volume being removed, where a podman volume rm -f can fail because the container was added to the volume after all containers using the volume were already removed, and that's a very narrow race.

@Luap99
Copy link
Member

Luap99 commented Oct 11, 2023

Given your reproducer and stack traces I think it is what I said in #20313 (comment)

The two process are stucked in setupContainer() while locking the volumes, the third one is just also waiting in the prepare call as it just wants access to its volume but it nevers gets it due the deadlock caused by the other process.

I agree with @mheon here the locking is unnecessary in this place. The db logic should prevent this from happening and if the locks were really needed then this is already mostly useless as the volume can be removed before it is locked.

@Luap99 Luap99 self-assigned this Oct 11, 2023
@vrothberg
Copy link
Member

I agree with @mheon here the locking is unnecessary in this place. The db logic should prevent this from happening and if the locks were really needed then this is already mostly useless as the volume can be removed before it is locked.

Agreed. Removing such objects in parallel to trying to use them is always racy and not something Podman can prevent. +1 to removing the locking there.

Luap99 added a commit to Luap99/libpod that referenced this issue Oct 11, 2023
When containers are created with a named volume it can deadlock because
the create logic tried to lock all volumes in a loop, this is fine if it
only ever creates a single container at any given time. However because
we multiple containers can be created at the same time they can cause a
deadlock between the volumes. This is because the order of the loop is
not stable, in fact it is based on the order of how the volumes were
specified on the cli.

So if you create two containers at the same time with
`-v vol1:/dir2 -v vol2:/dir2` and the other one with
`-v vol2:/dir2 -v vol1:/dir1` then there is chance for a deadlock.

Now one solution could be to order the volumes to prevent the issue but
the reason for holding the lock is dubious. The goal was to prevent the
volume from being removed in the meantime. However that could still
have happend before we acquired the lock so it didn't protect against
that.

Both boltdb and sqlite already prevent us from adding a container with
volumes that do not exists due their internal consistency checks.
Sqlite even uses FOREIGN KEY relationships so the schema will prevent us
from doing anything wrong.

The create code currently first checks if the volume exists and if not
creates it. I have checked that the db will guarantee that this will not
work:
Boltdb: `no volume with name test2 found in database when adding container xxx: no such volume`
Sqlite: `adding container volume test2 to database: FOREIGN KEY constraint failed`

Keep in mind that this error is normally not seen, only if the volume is
removed between the volume exists check and adding the container in the
db this messages will be seen wich is an acceptable race and a
pre-existing condition anyway.

[NO NEW TESTS NEEDED] Race condition, hard to test in CI.

Fixes containers#20313

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Oct 11, 2023

#20329 should fix it.

@UniversalSuperBox UniversalSuperBox changed the title Deadlock between setupContainer and lockNamedVolume Deadlock between multiple container creates happening simultaneously Oct 11, 2023
@UniversalSuperBox UniversalSuperBox changed the title Deadlock between multiple container creates happening simultaneously Deadlock between multiple container creates with similar volumes Oct 11, 2023
@UniversalSuperBox
Copy link
Author

I've updated the issue title and added a note to the description to help future travelers. For context on how we found this and why I was confused, we have a server which runs jobs which include podman run calls on systemd timers. We usually had two to three containers starting up at a time, and only the third container had volumes in a different order from the other two. I failed to notice that the deadlock only occurred in production when the container with the other volume order ran.

Thanks to everyone for your quick response!

mheon pushed a commit to mheon/libpod that referenced this issue Jan 9, 2024
When containers are created with a named volume it can deadlock because
the create logic tried to lock all volumes in a loop, this is fine if it
only ever creates a single container at any given time. However because
we multiple containers can be created at the same time they can cause a
deadlock between the volumes. This is because the order of the loop is
not stable, in fact it is based on the order of how the volumes were
specified on the cli.

So if you create two containers at the same time with
`-v vol1:/dir2 -v vol2:/dir2` and the other one with
`-v vol2:/dir2 -v vol1:/dir1` then there is chance for a deadlock.

Now one solution could be to order the volumes to prevent the issue but
the reason for holding the lock is dubious. The goal was to prevent the
volume from being removed in the meantime. However that could still
have happend before we acquired the lock so it didn't protect against
that.

Both boltdb and sqlite already prevent us from adding a container with
volumes that do not exists due their internal consistency checks.
Sqlite even uses FOREIGN KEY relationships so the schema will prevent us
from doing anything wrong.

The create code currently first checks if the volume exists and if not
creates it. I have checked that the db will guarantee that this will not
work:
Boltdb: `no volume with name test2 found in database when adding container xxx: no such volume`
Sqlite: `adding container volume test2 to database: FOREIGN KEY constraint failed`

Keep in mind that this error is normally not seen, only if the volume is
removed between the volume exists check and adding the container in the
db this messages will be seen wich is an acceptable race and a
pre-existing condition anyway.

[NO NEW TESTS NEEDED] Race condition, hard to test in CI.

Fixes containers#20313

Signed-off-by: Paul Holzinger <[email protected]>

MH: Backported to v4.6.1-rhel per RH Jira RHEL-20910 and
RHEL-20911.

Signed-off-by: Matt Heon <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 10, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 10, 2024
TomSweeneyRedHat pushed a commit to TomSweeneyRedHat/podman that referenced this issue Mar 15, 2024
Cherry pick from containers#20329
Addresses: https://issues.redhat.com/browse/RHEL-14744 and
https://issues.redhat.com/browse/RHEL-14743

When containers are created with a named volume it can deadlock because
the create logic tried to lock all volumes in a loop, this is fine if it
only ever creates a single container at any given time. However because
we multiple containers can be created at the same time they can cause a
deadlock between the volumes. This is because the order of the loop is
not stable, in fact it is based on the order of how the volumes were
specified on the cli.

So if you create two containers at the same time with
`-v vol1:/dir2 -v vol2:/dir2` and the other one with
`-v vol2:/dir2 -v vol1:/dir1` then there is chance for a deadlock.

Now one solution could be to order the volumes to prevent the issue but
the reason for holding the lock is dubious. The goal was to prevent the
volume from being removed in the meantime. However that could still
have happend before we acquired the lock so it didn't protect against
that.

Both boltdb and sqlite already prevent us from adding a container with
volumes that do not exists due their internal consistency checks.
Sqlite even uses FOREIGN KEY relationships so the schema will prevent us
from doing anything wrong.

The create code currently first checks if the volume exists and if not
creates it. I have checked that the db will guarantee that this will not
work:
Boltdb: `no volume with name test2 found in database when adding container xxx: no such volume`
Sqlite: `adding container volume test2 to database: FOREIGN KEY constraint failed`

Keep in mind that this error is normally not seen, only if the volume is
removed between the volume exists check and adding the container in the
db this messages will be seen wich is an acceptable race and a
pre-existing condition anyway.

[NO NEW TESTS NEEDED] Race condition, hard to test in CI.

Fixes containers#20313

Signed-off-by: Paul Holzinger <[email protected]>
Signed-off-by: tomsweeneyredhat <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants