Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout in sctp? forward expose? #14331

Closed
edsantiago opened this issue May 23, 2022 · 16 comments
Closed

timeout in sctp? forward expose? #14331

edsantiago opened this issue May 23, 2022 · 16 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

There's a flake that seems to be having cascading flake effects. I think, but am not sure, that the problem begins with the podman run forward sctp protocol test:

podman run forward sctp protocol
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:542
         
         [BeforeEach] Podman run networking
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:28
         [It] podman run forward sctp protocol
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:542
         # podman [options] --log-level=info run --name=test -p 80/sctp -p 81/sctp quay.io/libpod/alpine:latest
         time="2022-05-23T11:18:18-05:00" level=info msg="/var/tmp/go/src/github.com/containers/podman/bin/podman filtering at log level info"
         time="2022-05-23T11:18:18-05:00" level=info msg="Setting parallel job count to 7"
         [AfterEach] Podman run networking
           /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:37
         # podman [options] pod rm -fa -t 0
         # podman [options] rm -fa -t 0
         time="2022-05-23T11:20:24-05:00" level=info msg="Port reservation for SCTP is not supported"
         time="2022-05-23T11:20:24-05:00" level=info msg="Got Conmon PID as 347928"
         time="2022-05-23T11:20:24-05:00" level=info msg="Received shutdown.Stop(), terminating!" PID=347560
         c0b3e2b742ed097e53f23d3bde2379386a2c959f9c739feb573c3a1e5556f4ee
         output: c0b3e2b742ed097e53f23d3bde2379386a2c959f9c739feb573c3a1e5556f4ee
         
         

         ? Failure [126.689 seconds]
         Podman run networking
         /var/tmp/go/src/github.com/containers/podman/test/e2e/run_networking_test.go:20
           podman run forward sctp protocol [It]
           /var/tmp/go/src/github.com[/containers/podman/test/e2e/run_networking_test.go:542](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/run_networking_test.go#L542)
         
           Timed out after 90.001s.
           Expected process to exit.  It did not.
         
           /var/tmp/go/src/github.com/containers/podman/test/utils/utils.go:368
         
           Full Stack Trace
           github.com/containers/podman/v4/test/utils.(*PodmanSession).WaitWithTimeout(0xc001b6a408, 0x5a)
           	/var/tmp/go/src/github.com/containers/podman/test/utils/utils.go:368 +0x10b
           github.com/containers/podman/v4/test/utils.(*PodmanSession).WaitWithDefaultTimeout(...)
           	/var/tmp/go/src/github.com[/containers/podman/test/utils/utils.go:363](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/utils/utils.go#L363)
           github.com/containers/podman/v4/test/e2e.glob..func84.41()
           	/var/tmp/go/src/github.com[/containers/podman/test/e2e/run_networking_test.go:545](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/run_networking_test.go#L545) +0x157
           github.com/containers/podman/v4/test/e2e.TestLibpod(0xc000106c00)
           	/var/tmp/go/src/github.com[/containers/podman/test/e2e/common_test.go:102](https://github.com/containers/podman/blob/b7dbc505b6a7fbc14ba559acab31baf7d37f8b62/test/e2e/common_test.go#L102) +0xc5
           testing.tRunner(0xc000106c00, 0x15e4f20)
           	/usr/lib/golang/src/testing/testing.go:1203 +0xe5
           created by testing.(*T).Run
           	/usr/lib/golang/src/testing/testing.go:1248 +0x2b3

When this happens, it seems to trigger flakes in podman run network expose host port 80 to container port and podman run network expose duplicate host port results in error tests

I'm just going to link to the sctp failures here. In each of the logs below, the sctp failure is not alone: there are always associated expose failures. I don't know which causes which.

Podman run networking [It] podman run forward sctp protocol

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label May 23, 2022
@Luap99
Copy link
Member

Luap99 commented May 24, 2022

Is this only on fedora 35? If so it uses CNI and not netavark.

I agree that it looks like that the sctp test is causing issues however I do not understand why? The test never actually checks connection. It looks like it is a deadlock on the network backend lock. Therefore all other tests start timing out when they have to use the network.

@Luap99
Copy link
Member

Luap99 commented May 24, 2022

Ok so this is weird. I see many error opening database errors in all logs.

@mheon Any ideas?

@mheon
Copy link
Member

mheon commented May 24, 2022

I don't think it's Podman deleting the DB, so the test harness seems a likely culprit - but I'm not sure exactly why it would be doing so. The cleanup code could be responsible, but that should only run after all instructions have completed. Is there an internal Ginkgo timeout that could fire the cleanup code if the test itself takes too long, maybe?

@edsantiago
Copy link
Member Author

So far it seems to be f35 only. Here's yet another one.

I'm seeing a lot of yellow SLOW TEST warnings, but I haven't actually checked other CI runs to see if those are just normal.

@Luap99
Copy link
Member

Luap99 commented May 24, 2022

Failures seem to happen all at the same time. So I think one process is holding the network lock which then causes all other processes to be blocked and time out.

I also found this message in one of the logs:

Execing /usr/sbin/iptables -t nat -L
         
Another app is currently holding the xtables lock. Perhaps you want to use the -w option?

So maybe the iptables lock is blocked for some reason thus keeping the network lock blocked?

Are these the only failed logs? If so the question is what happened on/before May 15th?

@Luap99
Copy link
Member

Luap99 commented May 24, 2022

Looks like there was an image update merged on May 10th. #14016

@cevich Any chance to get the package versions between the old and new one? Interesting versions would be kernel, iptables and containernetworking-plugins.

@cevich
Copy link
Member

cevich commented May 24, 2022

Ya, that's within the 30-day window so the old images should still be around...

@cevich
Copy link
Member

cevich commented May 24, 2022

...looking back in git-history, PR #14178 merged just before mine. Logs from that job are still available. Pick one of the platforms you're interested in, and (in the Cirrus-CI WebUI) there will be a package_versions section. For example, like this F35 one: https://cirrus-ci.com/task/4571388802498560?logs=package_versions#L1

@edsantiago
Copy link
Member Author

Here's a failure in just the expose and hostname tests, without sctp: again, f35

@cevich
Copy link
Member

cevich commented May 24, 2022

From the current main:

  • kernel 5.17.4-200.fc35
  • iptables-legacy 1.8.7-13.fc35, nftables 1.0.0-1.fc35
  • containernetworking-plugins 1.1.0-1.fc35

From PR #14178:

  • kernel 5.16.20-200.fc35
  • iptables-legacy 1.8.7-13.fc35, nftables 1:1.0.0-1.fc35
  • containernetworking-plugins 1.1.0-1.fc35

From the f35 Ed just posted:

  • kernel 5.17.4-200.fc35
  • iptables legacy-1.8.7-13.fc35, nftables 1.0.0-1.fc35
  • containernetworking-plugins 1.1.0-1.fc35

@edsantiago
Copy link
Member Author

This is our number one flake right now and it's happening multiple times per PR

@Luap99
Copy link
Member

Luap99 commented May 25, 2022

Thanks @cevich. Can I just replace the image id in cirrus.yaml and run the get_ci_vm script to get the old VM image?

@cevich
Copy link
Member

cevich commented May 25, 2022

Can I just replace the image id in cirrus.yaml

Sometimes 😜 It depends on if there were script changes associated with the image update, or not. The safest thing to do is check out the commit that has the image you want, and use the script from there. Also, if you just need to poke around in the VM and don't intend to execute ./contrib/cirrus/runner.sh you can speed things up a bit by hitting ctrl-c when it starts to compile podman...unless of course you need a compiled podman 😃

@Luap99
Copy link
Member

Luap99 commented May 25, 2022

Thanks I think I am fine. I was able to repo locally. Hopefully #14361 fixed the issue. We will know for sure in a couple of days.

@Luap99
Copy link
Member

Luap99 commented May 30, 2022

@edsantiago Is this fixed?

@edsantiago
Copy link
Member Author

The last instance I see is on May 25, so I'll treat it as fixed. Thank you!

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

4 participants