Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

Open
edsantiago opened this issue Aug 1, 2024 · 7 comments
Open
Labels
flakes Flakes from Continuous Integration pasta pasta(1) bugs or features

Comments

@edsantiago
Copy link
Member

edsantiago commented Aug 1, 2024

         not ok 187 |505| UDP%2FIPv4 large transfer, loopback in 3178ms
...
<+371ms> # $ podman run --rm --name=c-socat-t187-j170t3a5 --net=pasta -p [127.0.0.1]:5298:5298/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5298 5298); do                              socat -u UDP4-LISTEN:${port},bind=[127.0.0.1],null-eof EXEC:md5sum &                          done; wait
<+2.11s> # 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # 2024/08/01 17:02:29 socat[3] W waitpid(0, 0x7ffc317a676c, 0): No child process
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: Mismatch between data sent and received
         # #| expected: = 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|   actual:   70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|         >   2024/08/01 17:02:29 socat\[3\] W waitpid\(0\, 0x7ffc317a676c\, 0\): No child process
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Only in my parallel-systest PR

x x x x x x
sys(2) podman(2) fedora-39(1) rootless(2) host(2) boltdb(1)
fedora-40(1) sqlite(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration pasta pasta(1) bugs or features labels Aug 1, 2024
@sbrivio-rh
Copy link
Collaborator

sbrivio-rh commented Aug 2, 2024

The test actually succeeds, but we include socat's stderr in the output string. That waitpid() doing nothing appears harmless to me, so I wonder if we should just drop socat's stderr here.

Redirecting socat's stderr to /dev/null is not great, though, because if there's an actual error we would have a hard time finding it.

I wonder if we should use another variable other than run_podman()'s $output, or tweak run_podman() to avoid merging stderr and stdout (my understanding of it is that it does that, at the moment).

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

the merging of $output is done by bats by default although newer bats version have --separate-stderr for run that can be used which would need to be added to run_podman

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

@sbrivio-rh
Copy link
Collaborator

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

I didn't reproduce this, so it's all a bit tentative, but I think xioshutdown() (other waitpid() warnings are printed differently) is called "too late", once the receiving child process (in the container) already terminated.

Which is fine, as xioshutdown() just needs to ensure that the child process terminates... but yes, it prints a warning. Perhaps we could fix this in socat by omitting that warning if the error is ECHILD, say:

diff --git a/xioshutdown.c b/xioshutdown.c
index c9fef8d..9290be6 100644
--- a/xioshutdown.c
+++ b/xioshutdown.c
@@ -134,8 +134,10 @@ int xioshutdown(xiofile_t *sock, int how) {
 	 Alarm(1 /*! sock->stream.para.exec.waitdie */);
 #endif /* !HAVE_SETITIMER */
 	 if (Waitpid(sock->stream.para.exec.pid, &status, 0) < 0) {
-	    Warn3("waitpid("F_pid", %p, 0): %s",
-		  sock->stream.para.exec.pid, &status, strerror(errno));
+	    if (errno != ECHILD) {
+	       Warn3("waitpid("F_pid", %p, 0): %s",
+	             sock->stream.para.exec.pid, &status, strerror(errno));
+            }
 	 }
 	 Alarm(0);
       }

but I would need to find out how to reproduce this, before being confident enough that the patch is correct.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

right but in order to fail with ECHILD it means socat must have called waitpid more often than it has child processes which sounds like the real bug to me. OF course hiding the warning on ECHILD is simple.

As far as reproducing goes check out PR #23275 and run
hack/bats --rootless --tag=ci:parallel -T 505

I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

And @edsantiago I am also seeing a ton of Failed to bind port 5328 (Address already in use) flakes so I don't think the random_free_port logic based on job slot is working...

@dgibson
Copy link
Collaborator

dgibson commented Aug 6, 2024

As far as reproducing goes check out PR #23275 and run hack/bats --rootless --tag=ci:parallel -T 505

I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

Tried this (on commit 67ebec1ae19141f56b390df334a2f0b104283b35 specifically), but was not able to reproduce (on my 20 thread laptop). I did once see what I think is not the original bug from #23471, but the TCP problem described later.

So, I don't really enough information to guess what's going wrong here.

@Luap99 any chance you could try to get an strace -f of the failing testcase? There's a pretty high chance that will change the timing enough that it won't reproduce any more, but it's worth a shot.

@edsantiago
Copy link
Member Author

Still happening with pasta 08-21

@Luap99
Copy link
Member

Luap99 commented Aug 23, 2024

Yeah I don't think this has anything to do with the podman or pasta version. To me this sounds like a bug in socat where it tries to reap the child process twice. Looking at the socat code there is both a signal handler for SIGCHLD that reaps childs and then the waitpid call on shutdown for the exec pid which reaps again. Thus I would assume if the handler runs before the waipid call on shutdown we see this warning. So I agree with @sbrivio-rh patch above as this is harmless and we likely should just work around that for now in the tests.

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 27, 2024
Workaround (NOT A FIX) for pasta issue containers#23482, wherein
podman logs includes a waitpid: ESRCH warning. Consensus
seems to be that this is a bug in socat.

Signed-off-by: Ed Santiago <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration pasta pasta(1) bugs or features
Projects
None yet
Development

No branches or pull requests

4 participants