Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint tests time out under $CONTAINER #15015

Closed
edsantiago opened this issue Jul 21, 2022 · 9 comments · Fixed by #19449
Closed

checkpoint tests time out under $CONTAINER #15015

edsantiago opened this issue Jul 21, 2022 · 9 comments · Fixed by #19449
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

All checkpoint-related tests are failing under containerized environment in CI (note: that is not a colorized/hyperlinked log. It is impossible to read without my greasemonkey extension).

Command timed out after 90s.   (basically, every test that uses podman checkpoint/restore)
@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2022
edsantiago added a commit to edsantiago/libpod that referenced this issue Jul 22, 2022
...and enable the at-test-time confirmation, the one that
double-checks that if CI requests runc we actually use runc.
This exposed a nasty surprise in our setup: there are steps to
define $OCI_RUNTIME, but that's actually a total fakeout!
OCI_RUNTIME is used only in e2e tests, it has no effect
whatsoever on actual podman itself as invoked via command
line such as in system tests. Solution: use containers.conf

Given how fragile all this runtime stuff is, I've also added
new tests (e2e and system) that will check $CI_DESIRED_RUNTIME.

Image source: containers/automation_images#146

Since we haven't actually been testing with runc, we need
to fix a few tests:

  - handle an error-message change (make it work in both crun and runc)
  - skip one system test, "survive service stop", that doesn't
    work with runc and I don't think we care.

...and skip a bunch, filing issues for each:

  - containers#15013 pod create --share-parent
  - containers#15014 timeout in dd
  - containers#15015 checkpoint tests time out under $CONTAINER
  - containers#15017 networking timeout with registry
  - containers#15018 restore --pod gripes about missing --pod
  - containers#15025 run --uidmap broken
  - containers#15027 pod inspect cgrouppath broken
  - ...and a bunch more ("podman pause") that probably don't
    even merit filing an issue.

Also, use /dev/urandom in one test (was: /dev/random) because
the test is timing out and /dev/urandom does not block. (But
the test is still timing out anyway, even with this change)

Also, as part of the VM switch we are now using go 1.18 (up
from 1.17) and this broke the gitlab tests. Thanks to @Luap99
for a quick fix.

Also, slight tweak to containers#15021: include the timeout value, and
reword message so command string is at end.

Also, fixed a misspelling in a test name.

Fixes: containers#14833

Signed-off-by: Ed Santiago <[email protected]>
@vrothberg
Copy link
Member

@edsantiago did the tests pass at some point before?

@edsantiago
Copy link
Member Author

Yes, they all worked fine prior to #14972 . That is the recent PR that did the VM switcheroo in CI. Here's an example of a PR that ran before 14972. Search in-page for " checkpoint" (space-checkpoint, to eliminate other checkpoint strings from podman info).

14972 is a huge monster, so it's impossible to know what changed, but the likely culprit is something different in the f36 image. Maybe criu, maybe kernel, I really can't begin to guess.

@vrothberg
Copy link
Member

Thanks, @edsantiago !

@rst0git
Copy link
Contributor

rst0git commented Aug 3, 2022

@edsantiago Is this problem with checkpoint/restore tests still present?

@edsantiago
Copy link
Member Author

@rst0git I assume so. The tests are completely disabled, so there's no way to find out except to reenable them. Since the VM images are unchanged, I don't think that would give us any information we don't already have.

@github-actions
Copy link

github-actions bot commented Sep 3, 2022

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Oct 6, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 29, 2023

@edsantiago should we reenable the test, to see if a miracle happened?

@edsantiago
Copy link
Member Author

I hate relying on miracles... but both f37 and f38 had a successful CI run in my hammer-sqlite PR, with questionable but valid timings (42:18 and 37:13 respectively, compared to 37:23 / 33:03 on a PR in main. Five minutes seems a little concerning, but I guess checkpointing is expensive?)

I verified that tests ran by grepping for 'checkpoint' in the summary lines and eyeballing the results, and by comparing N tests skipped in the bottom summary against another PR without the skips removed. It's about 35 tests that were being skipped and no longer are. This is consistent with grep -wc It test/e2e/checkpoint_{,image_}test.go.

Real PR now in the works.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jul 31, 2023
And lo, a miracle occurred. Containerized checkpoint tests are
no longer hanging. Reenable them.

(Followup miracle: tests are still passing, after a year of not
running!)

Closes: containers#15015

Signed-off-by: Ed Santiago <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 30, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants