pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

edsantiago · 2024-08-01T17:56:24Z

         not ok 187 |505| UDP%2FIPv4 large transfer, loopback in 3178ms
...
<+371ms> # $ podman run --rm --name=c-socat-t187-j170t3a5 --net=pasta -p [127.0.0.1]:5298:5298/udp quay.io/libpod/testimage:20240123 sh -c for port in $(seq 5298 5298); do                              socat -u UDP4-LISTEN:${port},bind=[127.0.0.1],null-eof EXEC:md5sum &                          done; wait
<+2.11s> # 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # 2024/08/01 17:02:29 socat[3] W waitpid(0, 0x7ffc317a676c, 0): No child process
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: Mismatch between data sent and received
         # #| expected: = 70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|   actual:   70aef73d20b2fa83e5f3f6b04419c2c6  -
         # #|         >   2024/08/01 17:02:29 socat\[3\] W waitpid\(0\, 0x7ffc317a676c\, 0\): No child process
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Only in my parallel-systest PR

fedora-39 : sys podman fedora-39 rootless host boltdb
- PR WIP: system test parallelization: two-pass approach #23275
  - 08-01 13:03 in [sys] |505| UDP%2FIPv4 large transfer, loopback
fedora-40 : sys podman fedora-40 rootless host sqlite
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-29 10:31 in [sys] |505| UDP%2FIPv4 large transfer, loopback

x	x	x	x	x	x
sys(2)	podman(2)	fedora-39(1)	rootless(2)	host(2)	boltdb(1)
		fedora-40(1)			sqlite(1)

The text was updated successfully, but these errors were encountered:

sbrivio-rh · 2024-08-02T06:49:55Z

The test actually succeeds, but we include socat's stderr in the output string. That waitpid() doing nothing appears harmless to me, so I wonder if we should just drop socat's stderr here.

Redirecting socat's stderr to /dev/null is not great, though, because if there's an actual error we would have a hard time finding it.

I wonder if we should use another variable other than run_podman()'s $output, or tweak run_podman() to avoid merging stderr and stdout (my understanding of it is that it does that, at the moment).

Luap99 · 2024-08-02T08:47:34Z

the merging of $output is done by bats by default although newer bats version have --separate-stderr for run that can be used which would need to be added to run_podman

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

sbrivio-rh · 2024-08-02T09:06:51Z

But in general the question is why do we get this warning in the first place? Is this a bug in socat?

I didn't reproduce this, so it's all a bit tentative, but I think xioshutdown() (other waitpid() warnings are printed differently) is called "too late", once the receiving child process (in the container) already terminated.

Which is fine, as xioshutdown() just needs to ensure that the child process terminates... but yes, it prints a warning. Perhaps we could fix this in socat by omitting that warning if the error is ECHILD, say:

diff --git a/xioshutdown.c b/xioshutdown.c
index c9fef8d..9290be6 100644
--- a/xioshutdown.c
+++ b/xioshutdown.c
@@ -134,8 +134,10 @@ int xioshutdown(xiofile_t *sock, int how) {
 	 Alarm(1 /*! sock->stream.para.exec.waitdie */);
 #endif /* !HAVE_SETITIMER */
 	 if (Waitpid(sock->stream.para.exec.pid, &status, 0) < 0) {
-	    Warn3("waitpid("F_pid", %p, 0): %s",
-		  sock->stream.para.exec.pid, &status, strerror(errno));
+	    if (errno != ECHILD) {
+	       Warn3("waitpid("F_pid", %p, 0): %s",
+	             sock->stream.para.exec.pid, &status, strerror(errno));
+            }
 	 }
 	 Alarm(0);
       }

but I would need to find out how to reproduce this, before being confident enough that the patch is correct.

Luap99 · 2024-08-02T09:32:25Z

right but in order to fail with ECHILD it means socat must have called waitpid more often than it has child processes which sounds like the real bug to me. OF course hiding the warning on ECHILD is simple.

As far as reproducing goes check out PR #23275 and run
hack/bats --rootless --tag=ci:parallel -T 505

I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

And @edsantiago I am also seeing a ton of Failed to bind port 5328 (Address already in use) flakes so I don't think the random_free_port logic based on job slot is working...

dgibson · 2024-08-06T02:45:40Z

As far as reproducing goes check out PR #23275 and run hack/bats --rootless --tag=ci:parallel -T 505

I see #23471 (maybe directly related?) on basically every run locally on my 12 thread laptop. I think I might have seen this issue as well but not in the few runs I tried today so this one might be a bit more difficult to trigger.

Tried this (on commit 67ebec1ae19141f56b390df334a2f0b104283b35 specifically), but was not able to reproduce (on my 20 thread laptop). I did once see what I think is not the original bug from #23471, but the TCP problem described later.

So, I don't really enough information to guess what's going wrong here.

@Luap99 any chance you could try to get an strace -f of the failing testcase? There's a pretty high chance that will change the timing enough that it won't reproduce any more, but it's worth a shot.

edsantiago · 2024-08-23T15:27:27Z

Still happening with pasta 08-21

Luap99 · 2024-08-23T16:25:19Z

Yeah I don't think this has anything to do with the podman or pasta version. To me this sounds like a bug in socat where it tries to reap the child process twice. Looking at the socat code there is both a signal handler for SIGCHLD that reaps childs and then the waitpid call on shutdown for the exec pid which reaps again. Thus I would assume if the handler runs before the waipid call on shutdown we see this warning. So I agree with @sbrivio-rh patch above as this is harmless and we likely should just work around that for now in the tests.

Workaround (NOT A FIX) for pasta issue containers#23482, wherein podman logs includes a waitpid: ESRCH warning. Consensus seems to be that this is a bug in socat. Signed-off-by: Ed Santiago <[email protected]>

edsantiago added flakes Flakes from Continuous Integration pasta pasta(1) bugs or features labels Aug 1, 2024

sbrivio-rh mentioned this issue Aug 2, 2024

pasta, parallel: test command timeout #23471

Closed

Luap99 mentioned this issue Aug 12, 2024

test/system: pasta_test_do add explicit port check #23561

Merged

edsantiago mentioned this issue Aug 27, 2024

CI: flake workaround: ignore socat waitpid warnings #23777

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

edsantiago commented Aug 1, 2024 •

edited

Loading

sbrivio-rh commented Aug 2, 2024 •

edited

Loading

Luap99 commented Aug 2, 2024

sbrivio-rh commented Aug 2, 2024

Luap99 commented Aug 2, 2024

dgibson commented Aug 6, 2024

edsantiago commented Aug 23, 2024

Luap99 commented Aug 23, 2024

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

pasta: IPv4 large transfer loopback: socat waitpid no child process #23482

Comments

edsantiago commented Aug 1, 2024 • edited Loading

sbrivio-rh commented Aug 2, 2024 • edited Loading

Luap99 commented Aug 2, 2024

sbrivio-rh commented Aug 2, 2024

Luap99 commented Aug 2, 2024

dgibson commented Aug 6, 2024

edsantiago commented Aug 23, 2024

Luap99 commented Aug 23, 2024

edsantiago commented Aug 1, 2024 •

edited

Loading

sbrivio-rh commented Aug 2, 2024 •

edited

Loading