-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] fix aarch64 cgroup flake #15179
Conversation
/hold |
not sure if this will fix it but this is definitely the culprit |
different error... promising! @cevich any idea if cpuset is enabled on aarch? |
The fact that this test is a flake, meaning, it passes sometimes, suggests to me that, whatever "cpuset" is, is enabled or at least enabled-enough to work sometimes. |
good point. I think I figured out the issue with chris, I just need to patch up a bunch of small holes all over the code that break with parallel usages |
Oh goodie, so it was parallel podman runs and test toe-stepping causing this? I too am mostly cgroup-ignorant, but this wouldn't be the first time I've seen a shared resource cause a race. It's good CI caught this before a user found it 😄 |
@cevich I am sure this will keep failing and I'll have to keep squashing bugs of the same sort throughout the code... but getting there! |
I feel really stupid asking, but, where is the parallel coming from? System tests are single-threaded. |
@edsantiago honestly not sure, @cevich just said that there is a scenario in which parallel happens with these tests |
Oh maybe this is my bad. I thought it was the e2e tests where this was flaking. Those run 3-wide, but Ed's right, the system-tests run one by one. |
The fact that these patches still have positive impacts makes me think something weird is happening with the test env. It seems like there is some cleanup process deleting everything in sys/fs/cgroup before it can be used. |
During the build process we shutdown and disable a bunch of systemd services, it's possible a new one was added and/or we missed one: https://github.com/containers/automation_images/blob/main/systemd_banish.sh#L14 |
hit a wall here. Permission denied error makes no sense @giuseppe any ideas why this would be happening?
|
@cdoern FWIW, we do capture the Though it's possible for them to be suppressed by the policy, it's not normally the case. So I have a bit less faith this is caused by SELinux related issues, but it's still worth trying since we need to wait for Giuseppe anyway. The image build is nearly done... |
@cevich these system tests have been running for a few hours... |
I never got a reply on IRC from @giuseppe on this. Summary of where we're at on the aarch64 system test flakes:
Summary of changes in aaef571:
So in summary, I'm not super-duper confident in either the SELinux or package updates as "Fixes" for this failure. I can't explain the lack of audit.log entries, but also don't have any idea what the |
(I just restarted the aarch64 build task, it was holding up the system-test task) |
Hrmmm, it seems the aarch64 build task is hanging on re-run. https://cirrus-ci.com/task/5667829994225664 |
...ya, damn, my change broke it. Status shows:
Ugh. |
Oof, I don't know how to fix the failure of the service to start, it's like the policy won't let this service be unconfined. The unit-file modification is correct AFAIK: ### File generated during VM Image build by fedora_base-setup.sh
[Unit]
Description=Execute cloud user/final scripts
After=network-online.target cloud-config.service rc-local.service
Wants=network-online.target cloud-config.service
[Service]
Type=oneshot
SELinuxContext=unconfined_u:unconfined_r:unconfined_t:s0
ExecStart=/usr/bin/cloud-init modules --mode=final
RemainAfterExit=yes
TimeoutSec=0
KillMode=process
TasksMax=infinity
# Output needs to appear in instance console output
StandardOutput=journal+console
[Install]
WantedBy=cloud-init.target Anyway...maybe this isn't the cause of the test-flake. Ed's point about "Why isn't this causing other SELinux tests to fail" I still cannot explain. |
I played around on these VMs some more, and it seems if I disable SELinux, I can |
Whew! I see one of the EC2 VMs successfully booted and ran tests...so we're on a better track this time hopefully 😅 |
@cdoern is this the flake you're chasing?:
? |
Yes, #15074 |
Okay, so then for sure the SELinux fix had no impact, which isn't all that shocking. So is the Why the heck do we only see this on EC2 and/or aarch64 VMs 😖 We have an x86_64 image in EC2, built substantially the same way. Is it worth rigging up the task to run with that to try and rule out the architecture vs EC2 as the contributing factor? |
Re: "Problem does not reproduce under This is even more perplexing now with the SELiunux execution context ruled out. About the only difference now is executing within an interactive login session ( Random idea (long-shot): Maybe we should run that |
Does anybody know if this problem reproduces on This would be an interesting data-point, as those tests all run through ssh (so they have a session) just like |
Is the important part of your question the remote part? If so, this test is N/A on remote, so it's skipped. If the important part is something else, can you please highlight it? |
Yeah the remote part. Damn. Right I vaguely remember now (in the e2e tests), we do skip these don't we. It's really bugging me that this only seems to reproduce under Cirrus-CI. I believe there are very few differences (besides a bunch of Hmmmm. I'm about out of ideas, including the hair-brained ones 😩 |
This may be significant: Poking around under |
A friendly reminder that this PR had no activity for 30 days. |
@cdoern Whats up with this one? |
A friendly reminder that this PR had no activity for 30 days. |
Flake is #15367, the everything-hosed one. I'm no longer restarting that one, because when I do, nobody sees it, and nobody sees how often it's triggering, so nobody is incentivized to fix it. We really, really need that one fixed. |
No action since prior to September. Time to close this? |
@cdoern could you rebase and repush so we can see how the latest CI images behave? |
Sure @edsantiago will do that tonight |
Signed-off-by: Charlie Doern <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cdoern The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks. That answers that:
The bug is still present. |
I am going to close the PR to get it off @cdoern's back. |
test aarch64 pod cgroups with main
Signed-off-by: Charlie Doern [email protected]
Does this PR introduce a user-facing change?