Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus: Bump Fedora to release 35 #11795

Merged
merged 8 commits into from
Nov 18, 2021

Conversation

cevich
Copy link
Member

@cevich cevich commented Sep 29, 2021

What this PR does / why we need it:

Update the "Fedora" VM used in testing to version 35 (beta), and the "Prior Fedora" version to 34. Stop testing with Fedora 33. Also resolve a TODO in setup - the needed packages are now bundled into the images.

NOTE: The Fedora 35 image is now based on a BTRFS root filesystem.

How to verify it

CI will pass

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Depends on: #11955 and #11979 and ##12110 and #12121 or #12120 and #12060 and #12162 and #12342 and #12343 and containers/automation_images#93

@cevich cevich force-pushed the update_to_f35 branch 2 times, most recently from 24d631a to 8d852c0 Compare September 29, 2021 20:52
@cevich cevich marked this pull request as draft September 30, 2021 13:46
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2021
@cevich cevich changed the title Cirrus: Bump Fedora to release 35 (beta) WIP: Cirrus: Bump Fedora to release 35 (beta) Sep 30, 2021
@containers containers deleted a comment from rhatdan Sep 30, 2021
@cevich
Copy link
Member Author

cevich commented Sep 30, 2021

Sorry, forgot to mark this as WIP/Draft. There are some major challenges that need solving: Namely the kernel shouldn't panic when compiling podman 😢

@cevich cevich force-pushed the update_to_f35 branch 3 times, most recently from 1887c4d to 4ed90cb Compare October 1, 2021 14:46
@cevich
Copy link
Member Author

cevich commented Oct 1, 2021

@lsm5 I'm getting test-rpm build errors on the new F35 VM. Do you know how to fix these?

@cevich
Copy link
Member Author

cevich commented Oct 4, 2021

For int podman fedora-35 root and rootless, the tests are hanging. Both appear to be getting stuck on a checkpoint test (log).

Edit: My mistake, not "rootless" but "container" integration.

@cevich
Copy link
Member Author

cevich commented Oct 4, 2021

@adrianreber ping - any insights on the checkpoint hang?

@adrianreber
Copy link
Collaborator

@adrianreber ping - any insights on the checkpoint hang?

rootless tests should not run any checkpoint tests at all. The root test hangs while trying to establish a TCP connection to the container. I can run it locally on f35 tomorrow to see if I can reproduce it.

Not sure why it should hang with f35 right now:

                // Open a network connection to the redis server via initial port mapping
                conn, err := net.Dial("tcp", "localhost:1234")
                if err != nil {
                        os.Exit(1)
                }
                conn.Close()

Most checkpoint tests seem to be running. Just as an FYI, if all checkpoint tests start to fail: CRIU needs something like chattr +C /var/lib/containers to work with BTRFS.

@cevich
Copy link
Member Author

cevich commented Oct 4, 2021

CRIU needs something like chattr +C /var/lib/containers to work with BTRFS.

BTRFS will be the stock/default for F35 VMs, so should we preemptively change that at the packaging level? Or is that just a caution / might happen warning?

@lsm5
Copy link
Member

lsm5 commented Oct 4, 2021

@lsm5 I'm getting test-rpm build errors on the new F35 VM. Do you know how to fix these?

disable debuginfo in specfile, set %global debug_package %{nil} . I thought this was disabled already in a prior PR from @Luap99 . But maybe not?

@adrianreber
Copy link
Collaborator

CRIU needs something like chattr +C /var/lib/containers to work with BTRFS.

BTRFS will be the stock/default for F35 VMs, so should we preemptively change that at the packaging level? Or is that just a caution / might happen warning?

Just that it might happen. If it works without the chattr, you can leave it as it is.

@Luap99
Copy link
Member

Luap99 commented Oct 4, 2021

disable debuginfo in specfile, set %global debug_package %{nil} . I thought this was disabled already in a prior PR from @Luap99 . But maybe not?

My PR is not merged yet (#11565).

@cevich
Copy link
Member Author

cevich commented Oct 4, 2021

Okay great, thanks @lsm5 and @Luap99, and no rush merging that, there are plenty of other issues to address in this PR in the meantime.

@cevich
Copy link
Member Author

cevich commented Oct 6, 2021

I can run it locally on f35 tomorrow to see if I can reproduce it.

@adrianreber were you able to reproduce the hang at all?

@adrianreber
Copy link
Collaborator

No, unfortunately not. It just works in my VM.

@cevich
Copy link
Member Author

cevich commented Oct 6, 2021

Darn.

@cevich
Copy link
Member Author

cevich commented Oct 11, 2021

New images: I see a problem on F34, where / diskspace is not being expanded as it should. Oddly, it appears that F35 BTRFS / is being expanded. Both are experiencing test failures/hangs 😢

@cevich
Copy link
Member Author

cevich commented Oct 12, 2021

Second time in a row the podman checkpoint and restore container with different port mappings hangs on F35. I'm going to try and instrument the code to see where the hang is happening.

@cevich cevich force-pushed the update_to_f35 branch 2 times, most recently from cd39642 to 9744f90 Compare October 12, 2021 18:09
@cevich
Copy link
Member Author

cevich commented Oct 12, 2021

@adrianreber I instrumented the test with a bunch of fmt.Printf() and ran it manually. It seems to be hanging on the conn, err := net.Dial("tcp", "localhost:1234"). Should that be wrapped in some kind of timeout, and shouldn't it be connecting to port 6379?

@adrianreber
Copy link
Collaborator

@adrianreber I instrumented the test with a bunch of fmt.Printf() and ran it manually. It seems to be hanging on the conn, err := net.Dial("tcp", "localhost:1234"). Should that be wrapped in some kind of timeout, and shouldn't it be connecting to port 6379?

The timeout would be a good idea. The port is correct with 1234 because that test is trying to handle different port mappings after restore. I am confused, however, I do not see the broken test being run. There is still a timeout but I am also not sure why.

Can you also try a change like this (in addition to your timeout change):

                conn, err := net.Dial("tcp", "localhost:1234")
-               if err != nil {
-                       os.Exit(1)
-               }
+               Expect(err).To(BeNil())

The Exit(1) is probably a bad idea anyway. I also had a look at the audit log but it does not seem to be a SELinux denial.

@edsantiago
Copy link
Member

They're flakes. One of the common ones. Every so often registry.redhat.io goes into a mode where it demands a login even for search. It's usually fixed in a few hours.

@cevich
Copy link
Member Author

cevich commented Nov 18, 2021

They're flakes. One of the common ones. Every so often registry.redhat.io goes into a mode where it demands a login even for search. It's usually fixed in a few hours.

Yeah I've seen that before too, and guessed it's what we were seeing. Thanks for confirming.

Hey, on that topic: What's your opinion on me adding a check to the Ext. services task (runs very early), to verify this? It would save people wasting time re-running due to a known issue we literally can't do anything about (AFAIK). I'm on the fence about it vs getting "some" useful testing completed.

Signed-off-by: Chris Evich <[email protected]>
These tasks run earlier on, so it's useful to have more detail about the
test VM (in general) in case something goes terribly wrong.

Signed-off-by: Chris Evich <[email protected]>
During initial testing of Fedora 35beta VM images in CI, the bindings
task was timing out.  In order to allow time for collection of system
details (logs), execution needs to timeout earlier than the task.
Under normal conditions, the bindings test finishes in about 10-minutes.
Use the ginkgo timeout option to limit execution, so it times out after
30 minutes.

Also add the `-progress` option so the output more closely resembles how
ginkgo runs the integration tests.

Signed-off-by: Chris Evich <[email protected]>
Massive thanks to @edsantiago for tracking this down.

Ref: containers#12175

Signed-off-by: Chris Evich <[email protected]>
In F35 the hard-coded default (from
containers-common-1-32.fc35.noarch) is 'journald' despite
the upstream repository having this line commented-out.
Containerized integration tests cannot run with 'journald'
as there is no daemon/process there to receive them.

Signed-off-by: Chris Evich <[email protected]>
This reverts commit f35d7f4.

Signed-off-by: Chris Evich <[email protected]>
VM Images created as of this commit contain the new/required version.
Remove the `--force` install, but retain the hack script's ability to
support this in the future.

Signed-off-by: Chris Evich <[email protected]>
The Fedora 35 cloud images have switched to UEFI boot with a GPT
partition. Formerly, all Fedora images included support for runtime
re-partitioning. However, the requirement to test alternate storage
has since been dropped/removed.  Rather than maintain a disused
feature, and supporting scripts, these Fedora VM images have reverted
to the default: Automatically resize to 100% on boot.

Signed-off-by: Chris Evich <[email protected]>
@edsantiago
Copy link
Member

Well, it doesn't fit into the host/port model, so some hackery would be needed to try "podman search", which gets us into chicken/egg territory.

Thinking on it some more, though, I rarely see all int tests failing because of this. It's more typical to see just one or two. I don't know what that means about the nature of the flake, but from the perspective of CI, I think an early check would not be accurate enough: it could easily false-negative as well as false-positive.

@cevich
Copy link
Member Author

cevich commented Nov 18, 2021

Y'okay! Rebased/force-pushed "once last time" (note: this may not be the real "last time") 😁

it could easily false-negative as well as false-positive.

Yeah...that sounds anti-simple to deal with, esp. in a "Required by everything" check.

@vrothberg
Copy link
Member

Fingers crossed 🤞 Thanks for your patience and your long breath pushing this forward, @cevich !

@cevich
Copy link
Member Author

cevich commented Nov 18, 2021

Thanks for your patience and your long breath pushing this forward

I absolutely could not have done this without the support from...basically everybody. So thank YOU!

I pray we never have to upgrade any OS, ever again...or at least for another 6 months 😭

Seriously though, I've been wondering recently if all these fresh-new gray-hairs are worth it: Concentrating all the pain into one PR for the stability of all/most other PRs. Or maybe we should just run with rawhide and continuous runtime updates to share the instability "love" to every PR - but with rapid "fixes" available (in theory).

Seems to me, no matter how we do CI...everyone still always hates it 😆

@cevich
Copy link
Member Author

cevich commented Nov 18, 2021

Ugh, two Ubuntu failures. Naively assuming they're flakes and re-running.

@edsantiago
Copy link
Member

/lgtm
/hold

Quick, before CI changes its mind! (Tests are green as of the moment I'm writing this. I would not be in the least bit surprised if they turn red somehow in the next few minutes, just out of orneriness).

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2021
@rhatdan
Copy link
Member

rhatdan commented Nov 18, 2021

/approve
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 18, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2021
@openshift-merge-robot openshift-merge-robot merged commit de39241 into containers:main Nov 18, 2021
@cevich
Copy link
Member Author

cevich commented Nov 19, 2021

just out of orneriness

They are imbibed with the pure essential nature of "ornery" and will require every magic spell to be set right 😞

@cevich cevich deleted the update_to_f35 branch April 18, 2023 14:47
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 30, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.