timeout in sctp? forward expose? #14331

edsantiago · 2022-05-23T20:39:16Z

There's a flake that seems to be having cascading flake effects. I think, but am not sure, that the problem begins with the podman run forward sctp protocol test:

podman run forward sctp protocol
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:542
         
         [BeforeEach] Podman run networking
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:28
         [It] podman run forward sctp protocol
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:542
         # podman [options] --log-level=info run --name=test -p 80/sctp -p 81/sctp quay.io/libpod/alpine:latest
         time="2022-05-23T11:18:18-05:00" level=info msg="/var/tmp/go/src/github.com/containers/podman/bin/podman filtering at log level info"
         time="2022-05-23T11:18:18-05:00" level=info msg="Setting parallel job count to 7"
         [AfterEach] Podman run networking
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:37
         # podman [options] pod rm -fa -t 0
         # podman [options] rm -fa -t 0
         time="2022-05-23T11:20:24-05:00" level=info msg="Port reservation for SCTP is not supported"
         time="2022-05-23T11:20:24-05:00" level=info msg="Got Conmon PID as 347928"
         time="2022-05-23T11:20:24-05:00" level=info msg="Received shutdown.Stop(), terminating!" PID=347560
         c0b3e2b742ed097e53f23d3bde2379386a2c959f9c739feb573c3a1e5556f4ee
         output: c0b3e2b742ed097e53f23d3bde2379386a2c959f9c739feb573c3a1e5556f4ee
         
         

         ? Failure [126.689 seconds]
         Podman run networking
         /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:20
           podman run forward sctp protocol [It]
           /var/tmp/go/src/github.com[/containers/podman/test/e2e/run_networking_test.go:542](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/run_networking_test.go#L542)
         
           Timed out after 90.001s.
           Expected process to exit.  It did not.
         
           /var/tmp/go/src/github.com/containers/podman/test/utils/utils.go:368
         
           Full Stack Trace
           github.com/containers/podman/v4/test/utils.(*PodmanSession).WaitWithTimeout(0xc001b6a408, 0x5a)
           	/var/tmp/go/src/github.com/containers/podman/test/utils/utils.go:368 +0x10b
           github.com/containers/podman/v4/test/utils.(*PodmanSession).WaitWithDefaultTimeout(...)
           	/var/tmp/go/src/github.com[/containers/podman/test/utils/utils.go:363](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/utils/utils.go#L363)
           github.com/containers/podman/v4/test/e2e.glob..func84.41()
           	/var/tmp/go/src/github.com[/containers/podman/test/e2e/run_networking_test.go:545](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/run_networking_test.go#L545) +0x157
           github.com/containers/podman/v4/test/e2e.TestLibpod(0xc000106c00)
           	/var/tmp/go/src/github.com[/containers/podman/test/e2e/common_test.go:102](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/common_test.go#L102) +0xc5
           testing.tRunner(0xc000106c00, 0x15e4f20)
           	/usr/lib/golang/src/testing/testing.go:1203 +0xe5
           created by testing.(*T).Run
           	/usr/lib/golang/src/testing/testing.go:1248 +0x2b3

When this happens, it seems to trigger flakes in podman run network expose host port 80 to container port and podman run network expose duplicate host port results in error tests

I'm just going to link to the sctp failures here. In each of the logs below, the sctp failure is not alone: there are always associated expose failures. I don't know which causes which.

Podman run networking [It] podman run forward sctp protocol

fedora-35 : int podman fedora-35 root host

The text was updated successfully, but these errors were encountered:

Luap99 · 2022-05-24T14:59:42Z

Is this only on fedora 35? If so it uses CNI and not netavark.

I agree that it looks like that the sctp test is causing issues however I do not understand why? The test never actually checks connection. It looks like it is a deadlock on the network backend lock. Therefore all other tests start timing out when they have to use the network.

Luap99 · 2022-05-24T16:32:02Z

Ok so this is weird. I see many error opening database errors in all logs.

@mheon Any ideas?

mheon · 2022-05-24T16:41:12Z

I don't think it's Podman deleting the DB, so the test harness seems a likely culprit - but I'm not sure exactly why it would be doing so. The cleanup code could be responsible, but that should only run after all instructions have completed. Is there an internal Ginkgo timeout that could fire the cleanup code if the test itself takes too long, maybe?

edsantiago · 2022-05-24T16:51:51Z

So far it seems to be f35 only. Here's yet another one.

I'm seeing a lot of yellow SLOW TEST warnings, but I haven't actually checked other CI runs to see if those are just normal.

Luap99 · 2022-05-24T17:03:35Z

Failures seem to happen all at the same time. So I think one process is holding the network lock which then causes all other processes to be blocked and time out.

I also found this message in one of the logs:

Execing /usr/sbin/iptables -t nat -L
         
Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

So maybe the iptables lock is blocked for some reason thus keeping the network lock blocked?

Are these the only failed logs? If so the question is what happened on/before May 15th?

Luap99 · 2022-05-24T17:08:49Z

Looks like there was an image update merged on May 10th. #14016

@cevich Any chance to get the package versions between the old and new one? Interesting versions would be kernel, iptables and containernetworking-plugins.

cevich · 2022-05-24T17:56:25Z

Ya, that's within the 30-day window so the old images should still be around...

cevich · 2022-05-24T18:01:16Z

...looking back in git-history, PR #14178 merged just before mine. Logs from that job are still available. Pick one of the platforms you're interested in, and (in the Cirrus-CI WebUI) there will be a package_versions section. For example, like this F35 one: https://cirrus-ci.com/task/4571388802498560?logs=package_versions#L1

edsantiago · 2022-05-24T18:42:26Z

Here's a failure in just the expose and hostname tests, without sctp: again, f35

cevich · 2022-05-24T20:22:17Z

From the current main:

kernel 5.17.4-200.fc35
iptables-legacy 1.8.7-13.fc35, nftables 1.0.0-1.fc35
containernetworking-plugins 1.1.0-1.fc35

From PR #14178:

kernel 5.16.20-200.fc35
iptables-legacy 1.8.7-13.fc35, nftables 1:1.0.0-1.fc35
containernetworking-plugins 1.1.0-1.fc35

From the f35 Ed just posted:

kernel 5.17.4-200.fc35
iptables legacy-1.8.7-13.fc35, nftables 1.0.0-1.fc35
containernetworking-plugins 1.1.0-1.fc35

edsantiago · 2022-05-25T11:48:24Z

This is our number one flake right now and it's happening multiple times per PR

Luap99 · 2022-05-25T12:59:29Z

Thanks @cevich. Can I just replace the image id in cirrus.yaml and run the get_ci_vm script to get the old VM image?

cevich · 2022-05-25T16:09:39Z

Can I just replace the image id in cirrus.yaml

Sometimes 😜 It depends on if there were script changes associated with the image update, or not. The safest thing to do is check out the commit that has the image you want, and use the script from there. Also, if you just need to poke around in the VM and don't intend to execute ./contrib/cirrus/runner.sh you can speed things up a bit by hitting ctrl-c when it starts to compile podman...unless of course you need a compiled podman 😃

Luap99 · 2022-05-25T16:12:44Z

Thanks I think I am fine. I was able to repo locally. Hopefully #14361 fixed the issue. We will know for sure in a couple of days.

Luap99 · 2022-05-30T14:23:38Z

@edsantiago Is this fixed?

edsantiago · 2022-05-31T12:28:48Z

The last instance I see is on May 25, so I'll treat it as fixed. Thank you!

edsantiago added the flakes Flakes from Continuous Integration label May 23, 2022

edsantiago mentioned this issue May 24, 2022

Allow podman pod create to accept name argument #14334

Merged

edsantiago mentioned this issue May 25, 2022

CI flake: Podman run/create two containers with the same IP #7096

Closed

edsantiago mentioned this issue May 25, 2022

fix f35 integration test network flake #14361

Merged

edsantiago closed this as completed May 31, 2022

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout in sctp? forward expose? #14331

timeout in sctp? forward expose? #14331

edsantiago commented May 23, 2022

Luap99 commented May 24, 2022

Luap99 commented May 24, 2022

mheon commented May 24, 2022

edsantiago commented May 24, 2022

Luap99 commented May 24, 2022

Luap99 commented May 24, 2022

cevich commented May 24, 2022

cevich commented May 24, 2022 •

edited

Loading

edsantiago commented May 24, 2022

cevich commented May 24, 2022

edsantiago commented May 25, 2022

Luap99 commented May 25, 2022

cevich commented May 25, 2022

Luap99 commented May 25, 2022

Luap99 commented May 30, 2022

edsantiago commented May 31, 2022

timeout in sctp? forward expose? #14331

timeout in sctp? forward expose? #14331

Comments

edsantiago commented May 23, 2022

Podman run networking [It] podman run forward sctp protocol

Luap99 commented May 24, 2022

Luap99 commented May 24, 2022

mheon commented May 24, 2022

edsantiago commented May 24, 2022

Luap99 commented May 24, 2022

Luap99 commented May 24, 2022

cevich commented May 24, 2022

cevich commented May 24, 2022 • edited Loading

edsantiago commented May 24, 2022

cevich commented May 24, 2022

edsantiago commented May 25, 2022

Luap99 commented May 25, 2022

cevich commented May 25, 2022

Luap99 commented May 25, 2022

Luap99 commented May 30, 2022

edsantiago commented May 31, 2022

cevich commented May 24, 2022 •

edited

Loading