Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[main] Cirrus: Update VM Images for 4.0 release #13055

Merged
merged 2 commits into from
Feb 21, 2022

Conversation

cevich
Copy link
Member

@cevich cevich commented Jan 27, 2022

This is to ensure VM images for CI, which contain the
intended dependency versions to support the podman
4.0 release.

Ref: containers/automation_images#114

@cevich cevich marked this pull request as draft January 27, 2022 21:45
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2022
@cevich cevich changed the title Cirrus: Use updated VM images [WIP] Cirrus: Use updated VM images Jan 27, 2022
@cevich cevich force-pushed the new_python_images branch 6 times, most recently from 6b31cf0 to f7e2eee Compare February 8, 2022 16:04
@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

I'm trying to update our VM images, but for some reason checkpoint test (among others) aren't happy just on F34. @adrianreber any idea what's breaking in the checkpoint tests? Do we need package updates in F34 maybe?

@cevich cevich force-pushed the new_python_images branch from f7e2eee to 7b50354 Compare February 8, 2022 19:20
@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

@adrianreber un-ping. I think the problems may have been caused by premature/unintentional introduction of netavark & aardvark-dns packages on F35 for all tests.

@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

@adrianreber re-ping 😢 F34 criu is breaking with the latest updates. @edsantiago IIRC you touched these tests recently (disabled/enabled them), no? Do any of the errors in those logs ring any bells?

Since this is only affecting F34, I'm thinking it must be due to some package available in F35 (possible from updates-testing) that's necessary.

@edsantiago
Copy link
Member

Hm. The recent criu errors were due to rawhide kernel new-stuff. The errors I see here are all:

Error: cannot define the pod as the cgroup parent at the same time as joining the infra container's cgroupNS: invalid argument

...and they're happening in container create, before even thinking of checkpoint. This is not a criu problem.

Link to annotated log, which is infinitely better than the cirrus page.

@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

The recent criu errors were due to rawhide kernel new-stuff

Oh okay, so this is something different then.

before even thinking of checkpoint

There are some instances which do fail after criu:

Error: pod XYZ does not share the <something> namespace

But that sounds like it could be related to the error you pointed out. The packages on both new VM flavors are basically the same crun-1.4.2-1 and criu-3.16.1-2. Hmmm 😕

Edit: F35 annotated log ref.

@edsantiago
Copy link
Member

The packages on both new VM flavors are basically the same crun-1.4.2-1 and criu-3.16.1-2.

This is f34, so runc not crun, right? I see one runc error in the log, and we've been having runc-related failures in RHEL recently, so I think there's something in runc that broke recently. Can you compare if this runc == previous runc?

@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

This is f34, so runc not crun, right?

Oh right, I completely forgot, thanks!

I see one runc error in the log, and we've been having runc-related failures in RHEL recently, so I think there's something in runc that broke recently. Can you compare if this runc == previous runc?

Grabbing a run from main ("old" VM images), I see runc-1.0.2-2.fc34-x86_64. The new F34 image has runc-1.1.0-1.fc34-x86_64.

@edsantiago
Copy link
Member

Well, that's really interesting because some of the discussion on the RHEL 8.6 gating-test failures is zeroing in on the move from runc 1.0 to 1.1.

Are you able to rebuild the VM using runc 1.0, and just sweep 1.1 under the carpet, la la la?

@adrianreber
Copy link
Collaborator

The broken checkpoint/restore tests seem all to be about checkpointing out of and restoring into another pod. That seems broken with runc 1.1.

There is also something wrong with setting up a pod that shares a cgroup namespace.

         # podman [options] --network-backend cni --storage-driver vfs pod create --share cgroup,ipc,net,uts,pid
         Error: cannot define the pod as the cgroup parent at the same time as joining the infra container's cgroupNS: invalid argument

Seems like this triggers a couple of different errors.

@adrianreber
Copy link
Collaborator

Overall none of these checkpoint/restore errors seems to be a CRIU problem, but all of them seem to be triggered by changes in Podman and runc as far as I can tell. Strange that CI did not catch it earlier.

@adrianreber
Copy link
Collaborator

The cgroup error is related to #12930. But CI was using a runc version which did not support certain CRIU features and so the tests which break here were never run. Similar for the network namespace related errors.

#13214 has fixes for the checkpoint/restore related errors in this PR.

@cevich
Copy link
Member Author

cevich commented Feb 11, 2022

Thanks for all the analysis and PR-work @adrianreber I sincerely appreciate it. I can't tell from your comments, but hopefully these efforts will help ensure stability downstream for runc users.

@cevich
Copy link
Member Author

cevich commented Feb 11, 2022

Rebased on #13214 and force-pushed.

@cevich
Copy link
Member Author

cevich commented Feb 11, 2022

@cevich
Copy link
Member Author

cevich commented Feb 11, 2022

Are you able to rebuild the VM using runc 1.0, and just sweep 1.1 under the carpet, la la la?

Yes of course @edsantiago if that's what's needed, but I'm not sure I'd call it a solution - we'd not be testing in an environment representative of what users are using. The "proper" way to handle it for both tests and users, is to (somehow) roll back the version released in Fedora (but which I believe takes weeks 😖). Of course, I am assuming that the issue also affects the F34-stock podman 3.4 - may/not be true.

@cevich cevich force-pushed the new_python_images branch 2 times, most recently from 3b81078 to 2590deb Compare February 16, 2022 16:03
@cevich
Copy link
Member Author

cevich commented Feb 16, 2022

Rebased/force-pushed.

@cevich cevich changed the title [WIP] Cirrus: Use updated VM images Cirrus: Use updated VM images Feb 16, 2022
@cevich cevich changed the title Cirrus: Use updated VM images Cirrus: Update VM Images for 4.0 release Feb 16, 2022
@cevich cevich changed the title Cirrus: Update VM Images for 4.0 release [main] Cirrus: Update VM Images for 4.0 release Feb 16, 2022
Mainly this is to confirm some changes needed for the podman-py CI setup
don't disrupt operations here. Ref:

containers/automation_images#111

Also includes a minor steup fix WRT setting up for test-rpm build.

Signed-off-by: Chris Evich <[email protected]>
Podman 4.0 will never be supported in F34, and the use of F35 in CI is
temporary until F36 is brought up to speed.  Rather than fight with
testing issues that will never be fixed/supported, simply disable it.
This commit may be reverted at a future date when F36 VM support is
added.

Signed-off-by: Chris Evich <[email protected]>
@cevich cevich marked this pull request as ready for review February 17, 2022 21:39
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 17, 2022
@cevich
Copy link
Member Author

cevich commented Feb 21, 2022

This should be ready to go (The v4.0 PR w/ same images already merged).

@edsantiago
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 21, 2022
@rhatdan
Copy link
Member

rhatdan commented Feb 21, 2022

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2022
@openshift-merge-robot openshift-merge-robot merged commit 62ff040 into containers:main Feb 21, 2022
@edsantiago
Copy link
Member

New task map, for completeness's sake cirrus-map-pr13055

@cevich cevich deleted the new_python_images branch April 18, 2023 14:46
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 30, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants