-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory #3710
Comments
@edsantiago, Mr. Holmes, do you have data points that would suggest when the flake has started? |
Oh wow, this has been around for quite a while. I thought it was a new thing. Thanks for the data Ed. |
Just now in Integration fedora-34 w/ overlay in #3761 |
Last night on 'main' in cirrus-cron - also F34 + overlay. |
Seeing a persistent flake in the podman build-under-bud tests, same test, different error message:
[bud] 248 bud-multiple-platform-no-run
|
This continues to flake fairly regularly in daily branch runs. Latest example. |
|
I think there is a race on diff --git a/imagebuildah/build.go b/imagebuildah/build.go
index 77d8b6d5..a71e58aa 100644
--- a/imagebuildah/build.go
+++ b/imagebuildah/build.go
@@ -243,12 +243,12 @@ func BuildDockerfiles(ctx context.Context, store storage.Store, options define.B
logPrefix = "[" + platforms.Format(platformSpec) + "] "
}
builds.Go(func() error {
+ instancesLock.Lock()
thisID, thisRef, err := buildDockerfilesOnce(ctx, store, logger, logPrefix, platformOptions, paths, files)
if err != nil {
return err
}
id, ref = thisID, thisRef
- instancesLock.Lock()
instances = append(instances, instance{
ID: thisID,
Platform: platformSpec, |
I'm pretty sure the race here happens when we need to cancel a build because one stage failed, but others are still running. The |
@nalind I'm not sure about that but for me it happens even when I also hope that issue is at buildah layer then it would be much easier to get a permanent fix. |
A friendly reminder that this issue had no activity for 30 days. |
Very funny, github-actions. This is still an enormous annoyance. |
@edsantiago Thanks for reminding, I'll start looking at this again, hopefully with catch race this time. |
@nalind I think we have three different races happening here:
This typically happens because most of our pulled image resolution happens by name in build
Mostly I think it happens between https://github.com/containers/common/blob/main/libimage/pull.go#L54 and https://github.com/containers/common/blob/main/libimage/pull.go#L164, we can have a global runtime lock to prevent this.
All of these three race scenario looks unrelated to each other at first instance. |
Although these races are legit but it should not flake in CI if we are already expecting things to be sequential in CI with |
I don't know the code, but that sounds like it might prevent parallel pulling, no? IMHO, pulling in parallel is almost universally desirable since bandwidth is vastly cheaper than engineer-time 😀 |
@cevich We can make it granular so it only locks for pull with same name, so it should still allow other pulls in parallel. Anyways problem could be entirely different. @edsantiago @cevich I have a small question, do we have this flake since inception of |
The first instance I see was 2021-10-08 but unfortunately those logs are gone. |
Ref: containers#3710 Signed-off-by: Chris Evich <[email protected]>
The bud-multiple-platform-no-run test is flaking way too much. Disable it. See containers/buildah#3710 Signed-off-by: Ed Santiago <[email protected]>
A friendly reminder that this issue had no activity for 30 days. |
Dum de dum, still happening |
A friendly reminder that this issue had no activity for 30 days. |
Oddly, I haven't seen this pop up recently, @edsantiago and @lsm5, do you remember anything in podman-monitor for August? |
On podman, last seen Aug 2 |
Interesting, I remember this issue reproducing a lot more frequently. We saw it all the time (several times per week) in Buildah and Podman CI. Still, I s'pose maybe timings could have changed causing it to flake less 😞 . In any case, I also don't recall any recent activity on a fix either. @flouthoc where are we at with this, do you recall if anything else was done? |
@cevich I don't think this is fixed at all, we don't see it occurring as frequently now because we tweaked concurrency in tests for multi-platform builds so it's just suppressed. Last time when i was working on this i was unable to catch the race maybe i'll give it a shot again. |
Seems even more difficult now, at least in "wild-CI". Though generally I agree that some kind of reproducer would be helpful here. Do you have a notion of where the problem comes from and perhaps can that code be instrumented to force it to occur more reliably/frequently? |
e.g. if you want to spawn a VM with more CPUs from |
New failure (yesterday), in actual buildah CI, not podman cron:
|
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
IMHO we can close this. I haven't seen any occurrences happen in the daily jobs for quite a while. |
I wonder if this might have something to do with why we're not seeing the flake any more? Lines 5285 to 5288 in 4f8706b
|
Damn, I bet that's exactly the reason 😞 I would be in favor of removing that or possibly making it conditional on an env-var or something. |
Description
With fairly high frequently this test is failing, but sometimes it passes. This is affecting CI for all PRs for
main
and branch-testing itself.Steps to reproduce the issue:
main
branchbud-multiple-platform-no-run test
fails (example from PR 3706)Describe the results you received:
Describe the results you expected:
[+0853s] ok 282 bud-multiple-platform-no-run
Output of
rpm -q buildah
orapt list buildah
:Buildah compiled at runtime for PR #3706
Output of
buildah version
:Output of
podman version
if reporting apodman build
issue:N/A
Output of
cat /etc/*release
:Fedora 35
Output of
uname -a
:Fedora 35
Output of
cat /etc/containers/storage.conf
:Defaults on Fedora 35
The text was updated successfully, but these errors were encountered: