-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix podman rm -fa
with dependencies
#18507
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
165f62e
to
0e5c3d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you squash the lint fixes in to the correct commits please?
e7d1a55
to
9a5b581
Compare
wow I didn't even expect the stderr check to fail but this is exactly why it is needed:
|
Because you have to repush anyway: During the demo I also saw 9bd833b, feel free revert that as well. |
In case this isn't clear Which makes me think about #18452 except it is consistently in your PR so it is not a flake with your change. To me this sounds like we are trying to stop a container which is already stopped. |
Since both runtimes are doing it I can't complain too hard, but this definitely feels like unnecessary error logging that is just going to cause problems... There is no way I see to fully eliminate the TOCTOU with stopping containers, so the best thing for Podman to do is just to go ahead and kill the container, and then handle errors from the runtime as necessary (usually by just ignoring them - Podman here exits cleanly because it knows that nothing went wrong, the container's stopped, just not by us). As to why I've made it 100% consistent... I have some ideas (parallel removal of both dependency and dependent container in two threads, with the container with dependencies trying to recurse to the dependent, introduces a lot of potentially interesting race conditions), but again, it shouldn't matter - Podman did the right thing and ate the error, the only issue is the OCI runtimes insisting on printing unnecessary warnings. |
I agree there is always a race and we have little control over that. But as a user I really do not want to see an error reported. It just makes no sense. Problem 1 is that we pass the plain stdout/err streams down to the oci runtime which means we have zero control over the output they print. |
My current thinking is as follows: To remove a container, we have to stop it, which unlocks the container to wait for it to successfully terminate. Given that there are two threads trying to remove the first container (the original thread, which presumably gets in first and successfully stops and removes it, and the second thread, which started removing the other container, tried to recurse to its dependency container, found it was locked, but was able to grab the lock when stop unlocked it), we probably have two stopContainer operations overlapping. |
Re-pushed with the extra commit reverted and changed removeContainer to accept a struct. I'll keep thinking about the OCI runtime thing. I'm leaning towards not printing runtime STDOUT/STDERR unless |
I think we might find that the runtime is responsible for some fatal errors where it's good to have users see the messages; like this, for a very quick example: podman/test/system/410-selinux.bats Lines 202 to 213 in 287a419
Also, @mheon, tests are failing with a panic. Looks like an arg mismatch or deref of uninitialized pointer |
Yes, we use |
I don't know what kind of errors the runtimes report but there could be a lot of ones we actually want to see as @edsantiago shows. And hiding them on debug level will make flake catching/debugging much harder in CI. |
Absolutely. In case of an error, we need to think of a way to add the stdout and stderr logs to the error message. In case there is no error but some logs on stdout/stderr, then I think they should go into the debug log. |
Agreed but this is not something this PR should do because it will hide the issue, this patch is currently broken. It tries to kill the same container twice which is not something we should do. It is trivial to reproduce locally. The flake happens more than 50% of the tries:
Note I patched the logrus.Error in to prove podman is actually trying to kill it twice: diff --git a/libpod/oci_conmon_common.go b/libpod/oci_conmon_common.go
index d803276b1..99a6dccc5 100644
--- a/libpod/oci_conmon_common.go
+++ b/libpod/oci_conmon_common.go
@@ -356,7 +356,7 @@ func generateResourceFile(res *spec.LinuxResources) (string, []string, error) {
// If all is set, send to all PIDs in the container.
// All is only supported if the container created cgroups.
func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool) error {
- logrus.Debugf("Sending signal %d to container %s", signal, ctr.ID())
+ logrus.Errorf("Sending signal %d to container %s", signal, ctr.ID())
runtimeDir, err := util.GetRuntimeDir()
if err != nil {
return err |
I fully expected this, and honestly I don’t see it as much of a problem
modulo the OCI runtime error - there are no real consequences for the
double kill beyond a few wasted syscalls. If you want it fixed I will look
into it but it will be a substantial change.
…On Wed, May 10, 2023 at 04:42 Paul Holzinger ***@***.***> wrote:
I don't know what kind of errors the runtimes report but there could be a
lot of ones we actually want to see as @edsantiago
<https://github.com/edsantiago> shows. And hiding them on debug level
will make flake catching/debugging much harder in CI.
Absolutely. In case of an error, we need to think of a way to add the
stdout and stderr logs to the error message.
In case there is no error but *some* logs on stdout/stderr, then I think
they should go into the debug log.
Agreed but this is not something this PR should do because it will hide
the issue, this patch is currently broken. It tries to kill the same
container twice which is not something we should do.
It is trivial to reproduce locally. The flake happens more than 50% of the
tries:
$ bin/podman run --name t1 -d alpine top && bin/podman run --name t2 --network container:t1 -d alpine top && bin/podman rm -fa
23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1
6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
ERRO[0000] Sending signal 15 to container 6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
ERRO[0000] Sending signal 15 to container 6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
2023-05-10T08:40:02.420152Z: open pidfd: No such process
ERRO[0000] Sending signal 15 to container 23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1
6b7a43da9f040230a53ede1b04dc1a907ebc0a3a7bf194fc6372b7229af035f6
23cf575c1dcfcc865ac8332957eaf1e111acc475ff9f6a225636968aae8d1de1
Note I patched the logrus.Error in to prove the podman is actually trying
to kill it twice:
diff --git a/libpod/oci_conmon_common.go b/libpod/oci_conmon_common.go
index d803276b1..99a6dccc5 100644--- a/libpod/oci_conmon_common.go+++ b/libpod/oci_conmon_common.go@@ -356,7 +356,7 @@ func generateResourceFile(res *spec.LinuxResources) (string, []string, error) {
// If all is set, send to all PIDs in the container.
// All is only supported if the container created cgroups.
func (r *ConmonOCIRuntime) KillContainer(ctr *Container, signal uint, all bool) error {- logrus.Debugf("Sending signal %d to container %s", signal, ctr.ID())+ logrus.Errorf("Sending signal %d to container %s", signal, ctr.ID())
runtimeDir, err := util.GetRuntimeDir()
if err != nil {
return err
—
Reply to this email directly, view it on GitHub
<#18507 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCGVH72AA7XJBFCKCBLXFNIGJANCNFSM6AAAAAAX2IV6HE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Maybe I am missing something but to me this looks like our locking is just not correctly working. The problem here is clearly two threads trying to stop in parallel. So why do we allow two threads to change the container state at the same time? If locking would work as I expect it the second thread should be blocked and then when it gets the lock it must check for stopped/removed and not try to do the same again. |
Not sure whether that applies to this very symptom but we decided to allow for stopping a container even if it's in the |
My current tracing supports what @vrothberg is saying. Stopping the container deliberately unlocks (to ensure |
Added the second, in-memory container lock we were talking about; let's see if the tests start deadlocking. |
No joy on the tests @mheon |
This reverts commit 9bd833b. With the fix for `podman rm -fa` merged, we no longer require this patch. Signed-off-by: Matthew Heon <[email protected]>
We had something like 6 different boolean options (removing a container turns out to be rather complicated, because there are a million-odd things that want to do it), and the function signature was getting unreasonably large. Change to a struct to clean things up. Signed-off-by: Matthew Heon <[email protected]>
libpod/runtime_ctr.go
Outdated
c.lock.Lock() | ||
defer c.lock.Unlock() | ||
c.perProcessLock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more that could cause dealock, no?
| thread 1 | thread 2 |
| c lock() | |
| perProcces lock() | |
| # now trying to stop | |
| c unlock() | |
| | c lock() (lock taken) |
| | perProcces lock() # hang on this lock |
| c lock() after stop is hanging now |
---> possible ABBA deadlock
Or maybe I missing something here, the CI logs do not show any deadlock but instead still show the no such process runtime logs so I assume the lock isn't working right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a few days digging into this, and the results are deeply concerning. Findings summarized in commit message for 7f43cf5ef0137ccefd35e27e733b16a4009d36b5 in this PR. In short, I think it's fixed, but locks have been very broken within the same process for a very long time, and it seems like sync.Mutex also does not behave as we expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am considering removing the per-process lock given this, but I'd like a signoff by @vrothberg since I recall he felt it was needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are still failing, but it looks like it's back to the original issue (unlocking during stop -> second thread stopping container locks and continues) and no longer due to defective Libpod locks. So it's turned from a 100% thing into a race, albeit a very common race.
The good news is that the last commit does appear to have resolved our in-memory locks not actually providing mutual exclusion within the same thread. The bad news is that the sync.Mutex is, indeed, not doing its job and it seems that it's still possible for multiple threads to stop the container simultaneously. I am at a loss for why that might be. Will investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand now. We have multiple copies of the Container struct, so multiple copies of the sync.Mutex, so it's not protecting anything. Looking into making it a pointer and having the DB share it between all containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I don't think it is worth it as I really believe it will cause deadlocks see my graph above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Can we solve that by taking the per-process lock after the main container lock, but retaining the behavior where it is released second? This might also merit discussing during the watercooler, I think we need some sort of fix here because the race here is pretty nasty in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting decision for posterity: we will remove the extra lock, and just ignore the error messages from the OCI runtime specific to killing a container on runc/crun that is already dead (though we can still print them as debug logs).
@@ -136,6 +136,14 @@ shm_struct_t *setup_lock_shm(char *path, uint32_t num_locks, int *error_code) { | |||
goto CLEANUP_UNMAP; | |||
} | |||
|
|||
// Ensure that recursive locking of a mutex by the same OS thread (which may | |||
// refer to numerous goroutines) blocks. | |||
ret_code = pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_NORMAL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTHREAD_MUTEX_DEFAULT
Attempting to recursively lock a mutex of this type results in undefined behavior. Attempting to unlock a mutex of this type which was not locked by the calling thread results in undefined behavior. Attempting to unlock a mutex of this type which is not locked results in undefined behavior. An implementation may map this mutex to one of the other mutex types.
What a lovely default.
Note that the shm lock code already does the required runtime.LockOSThread()
on lock so the os thread will not be reused until we call runtime.UnlockOSThread() in the unlock call. Therefore it should be impossible for go to schedule two goroutines on this thread and we should never run into this situation where we actually need this. also see TestLockAndUnlockTwoSemaphore
Also on my fedora laptop PTHREAD_MUTEX_DEFAULT already maps to PTHREAD_MUTEX_NORMAL so this shouldn't make a difference at least on current fedora. Of course it is still required to explicitly set it given it is not specified anywhere and other c libs may work differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd think that lock/unlock OS thread would do that, but I'm actually seeing that, when we start a parallel container removal (5 containers, and 48 jobs), every single one of them starts with the same PID. I looked to see if we were holding a lock that might force every part of Podman onto the same OS thread (and basically eliminate threading within Podman), but that didn't seem to be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, it seems that Golang is still capable of swapping between Goroutines at will on a given thread, even after LockOSThread() is run. It just guarantees that the goroutine in question will only start on the thread in question, not that it won't schedule others there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That goes directly against the go docs for this function
LockOSThread wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread. If the calling goroutine exits without unlocking the thread, the thread will be terminated.
If it actually still executes other goroutines on that thread we have a major issues. We also use this to switch namespaces, set selinux lables, etc... If this doesn't work right now this needs to be fixed in golang.
This fixes a lint issue, but I'm keeping it in its own commit so it can be reverted independently if necessary; I don't know what side effects this may have. I don't *think* there are any issues, but I'm not sure why it wasn't a pointer in the first place, so there may have been a reason. Signed-off-by: Matthew Heon <[email protected]>
This was causing some CI flakes. I'm pretty sure that the pods being removed already isn't a bug, but just the result of another container in the pod removing it first - so no reason not to ignore the errors. Signed-off-by: Matthew Heon <[email protected]>
If the first container to get the pod lock is the infra container it's going to want to remove the entire pod, which will also remove every other container in the pod. Subsequent containers will get the pod lock and try to access the pod, only to realize it no longer exists - and that, actually, the container being removed also no longer exists. Signed-off-by: Matthew Heon <[email protected]>
Signed-off-by: Matthew Heon <[email protected]>
We use shared-memory pthread mutexes to handle mutual exclusion in Libpod. It turns out that these have configurable options for how to handle a recursive lock (IE, a thread trying to lock a lock that the same thread had previously locked). The mutex can either deadlock, or allow the duplicate lock without deadlocking. Default behavior is, helpfully, unspecified, so if not explicitly set there is no clear indication of which of these behaviors will be seen. Unfortunately, today is the first I learned of this, so our initial implementation did *not* explicitly set our preferred behavior. This turns out to be a major problem with a language like Golang, where multiple goroutines can (and often do) use the same OS thread. So we can have two goroutines trying to stop the same container, and if the no-deadlock mutex behavior is in use, both threads will successfully acquire the lock because the C library, not knowing about Go's lightweight threads, sees the same PID trying to lock a mutex twice, and allows it without question. It appears that, at least on Fedora/RHEL/Debian libc, the default (unspecified) behavior of the locks is the non-deadlocking version - so, effectively, our locks have been of questionable utility within the same Podman process for the last four years. This is somewhat concerning. What's even more concerning is that the Golang-native sync.Mutex that was also in use did nothing to prevent the duplicate locking (I don't know if I like the implications of this). Anyways, this resolves the major issue of our locks not working correctly by explicitly setting the correct pthread mutex behavior. Signed-off-by: Matthew Heon <[email protected]>
40620ff
to
ec7c6c1
Compare
There are certain messages logged by OCI runtimes when killing a container that has already stopped that we really do not care about when stopping a container. Due to our architecture, there are inherent races around stopping containers, and so we cannot guarantee that *we* are the people to kill it - but that doesn't matter because Podman only cares that the container has stopped, not who delivered the fatal signal. Unfortunately, the OCI runtimes don't understand this, and log various warning messages when the `kill` command is invoked on a container that was already dead. These cause our tests to fail, as we now check for clean STDERR when running Podman. To work around this, capture STDERR for the OCI runtime in a buffer only for stopping containers, and go through and discard any of the warnings we identified as spurious. Signed-off-by: Matthew Heon <[email protected]>
Tests passing, this is ready now. @Luap99 PTAL |
|
||
// Before handling error from KillContainer, convert STDERR to a []string | ||
// (one string per line of output) and print it, ignoring known OCI runtime | ||
// errors that we don't care about | ||
stderrLines := strings.Split(stderr.String(), "\n") | ||
for _, line := range stderrLines { | ||
if line == "" { | ||
continue | ||
} | ||
if strings.Contains(line, "container not running") || strings.Contains(line, "open pidfd: No such process") { | ||
logrus.Debugf("Failure to kill container (already stopped?): logged %s", line) | ||
continue | ||
} | ||
fmt.Fprintf(os.Stderr, "%s\n", line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do this only in the err != nil case?
I am not happy with string matches either but cannot offer something better so am ok with it, I guess CI will tell us quickly if these strings ever change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we'd still want to log warnings even in the err=nil case, but I don't feel that strongly and would be happy to downgrade them to debug logs if there is no error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah honestly not sure either, I am good with this as well.
PR is green in this sate which is a very good sign.
LGTM |
@containers/podman-maintainers PTAL |
LGTM |
libpod/runtime_ctr.go
Outdated
} | ||
|
||
logrus.Infof("Removing pod %s (dependency of container %s)", pod.ID(), c.ID()) | ||
if err := r.removePod(ctx, pod, true, force, timeout); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should just have returned r.removePod...
No need to hold up merge.
/lgtm |
This is a major consolidation of the logic for removing dependency containers.
As previous, the logic was contained in a separate function which ran prior to actually removing the container. This didn't work properly, and was honestly confusing in its flow - we had two completely separate paths for removing containers based on whether they had, or did not have, a dependency? This consolidates all removal logic into the existing container removal functions. As part of this, we needed to make some signature changes, as
removeContainer
can now remove 1 or more containers, and 0 or more pods, and we should probably let the user know what containers and pods were removed as part of this. Pod removal also needed to be changed to report what containers were removed through that (this was also plumbed through the remote API, but is not yet exposed via thepodman pod rm
CLI because I feel like it'd be a breaking change to do so).Fixes #18180
Does this PR introduce a user-facing change?