Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE Flake handling: cache and prefetch images #2036

Merged
merged 2 commits into from
Feb 7, 2020

Conversation

edsantiago
Copy link
Member

Show of hands: who here loves submitting a PR, then coming back
hours later to find one job failed, then spending time poring
over logs and finding a network error? Anyone? Anyone?

This is a lame attempt to minimize such flakes by caching
commonly-used images and restoring them on demand. We
introduce a new helper, _prefetch(), which podman-pulls
an image the first time, podman-saves it, then on
subsequent calls (for the same image) podman-loads it:

@test foo {
    _prefetch alpine busybox
    ...tests that run buildah-from either
}

This is an imperfect solution: it is incomplete and will
grow more so over time as new tests are added. It is
difficult to verify its coverage. I'm really unhappy
with it but if it works, the Total Sum Of Unhappiness
might decrease overall thanks to fewer flakes. If it
doesn't work, it's trivial to remove _prefetch calls
using a sed script. Shall we give it a chance?

Signed-off-by: Ed Santiago [email protected]

@edsantiago
Copy link
Member Author

This is hard to review: all the _prefetch insertions were added manually after many iterations of manual code inspection and testing. I would appreciate a careful eye to make sure that I didn't accidentally add a _prefetch where it doesn't belong.

Tested by instrumenting helpers.bash:run_buildah() to unshare -n before running buildah and adding _prefetch commands where I found necessary. Not everywhere, because some of the buildah from and buildah pull tests actually should test pulling from a registry. After many iterations, I've reduced the error count to 81. Many of these are noise and could be eliminated by setting up a custom network namespace allowing localhost (for fetches from local registry) and github.com. I spent far too long trying to make that work, unsuccessfully.

One question that has long troubled me and now bothers me even more: Is it absolutely necessary to use so many source images? E.g. tests/bud/from-as and tests/bud/use-args require centos and registry.centos.org/centos/centos:centos7; tests/bud/copy-from requires php:7.2 and composer; and a ton of tests use ubuntu. Is there any objection to my cleaning that up (in a separate PR) and sticking with alpine and busybox, or perhaps even scratch if possible?

  • See other FIXME comments as noted in github comments.

@@ -483,6 +507,7 @@ load helpers
}

@test "bud-http-context-dir-with-Dockerfile-post" {
# FIXME FIXME FIXME: this is 100% identical to the -pre test above.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you click on the expando (more context) above, you'll see that this test is identical to the -pre one. What was the intention here? Is it OK to collapse them both into one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. @TomSweeneyRedHat @nalind Any ideas?

@@ -1662,12 +1772,14 @@ load helpers

@test "bud using gitrepo and branch" {
target=gittarget
# FIXME: this test takes a really long time. Is it necessary to do twice?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost 4 minutes, because it uses fedora as base image and runs a dnf install. Is there some other simpler test we can do instead? Can we create a new module under github.com/containers just for testing this? Or even just a side branch of containers/BuildSourceImage that just does FROM scratch and RUN echo got here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about switching it to:
run_buildah bud --format docker --layers -f tests/bud/shell/Dockerfile -t test git://github.com/containers/buildah#master

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runs in 3.5 seconds.

@edsantiago
Copy link
Member Author

podman: command not found. Yikes - I thought podman was a requirement for these tests? Like, how is the local registry run? [pause] OMG these tests are using docker. I assume that will need to be corrected pretty soon, so let's just leave this PR on hold until CI is fixed to use podman.

@rhatdan
Copy link
Member

rhatdan commented Dec 20, 2019

I don't see a problem with consolidating on a couple of images. I do think it is handy to make sure we still have tests with apt-get and dnf/yum. But getting this to the point that we test fewer distinct images would be fine.

@rh-atomic-bot
Copy link
Collaborator

☔ The latest upstream changes (presumably 45543bf) made this pull request unmergeable. Please resolve the merge conflicts.

@rhatdan
Copy link
Member

rhatdan commented Jan 7, 2020

@edsantiago whats the scoop on this one?

@edsantiago edsantiago changed the title Flake handling: cache and prefetch images DO NOT MERGE Flake handling: cache and prefetch images Jan 7, 2020
@edsantiago
Copy link
Member Author

@rhatdan right now I'm just playing with it, wanting to see if/how it runs in CI, if it saves any time, if it reduces flakes. I've changed the title accordingly to DO NOT MERGE. Right now CI is running tests using docker, which obviously needs to be fixed. Once that gets resolved, I'll revert the fixup commit and submit for real review.

@edsantiago
Copy link
Member Author

Never mind: buildah push/pull docker-archive: does not work the way I expected (pull loses names). This can wait until CI is fixed to use podman instead of docker.

@edsantiago
Copy link
Member Author

/hold

Show of hands: who here loves submitting a PR, then coming back
hours later to find one job failed, then spending time poring
over logs and finding a network error? Anyone? Anyone?

This is a lame attempt to minimize such flakes by caching
commonly-used images and restoring them on demand. We
introduce a new helper, _prefetch(), which podman-pulls
an image the first time, podman-saves it, then on
subsequent calls (for the same image) podman-loads it:

    @test foo {
        _prefetch alpine busybox
        ...tests that run buildah-from either
    }

This is an imperfect solution: it is incomplete and will
grow more so over time as new tests are added. It is
difficult to verify its coverage. I'm really unhappy
with it but if it works, the Total Sum Of Unhappiness
might decrease overall thanks to fewer flakes. If it
doesn't work, it's trivial to remove _prefetch calls
using a sed script. Shall we give it a chance?

Signed-off-by: Ed Santiago <[email protected]>
Also: images json test: rewrite to actually check for
keys instead of just number of lines. Reason: when using
older podman to prefetch (in f29), 'history' key is lost,
giving us 26 lines of output instead of 30.

Signed-off-by: Ed Santiago <[email protected]>
bors bot added a commit that referenced this pull request Feb 7, 2020
2115: Kill Travis and Enable Bors r=rhatdan a=cevich

***Depends on:***  #1848 #1971 #2036 #2121

Co-authored-by: Ed Santiago <[email protected]>
Co-authored-by: Chris Evich <[email protected]>
bors bot added a commit that referenced this pull request Feb 7, 2020
2115: Kill Travis and Enable Bors r=rhatdan a=cevich

***Depends on:***  #1848 #1971 #2036 #2121

Co-authored-by: Ed Santiago <[email protected]>
Co-authored-by: Chris Evich <[email protected]>
bors bot added a commit that referenced this pull request Feb 7, 2020
2115: Kill Travis and Enable Bors r=rhatdan a=cevich

***Depends on:***  #1848 #1971 #2036 #2121

Co-authored-by: Ed Santiago <[email protected]>
Co-authored-by: Chris Evich <[email protected]>
@bors bors bot merged commit f0b7958 into containers:master Feb 7, 2020
@edsantiago
Copy link
Member Author

Oh yikes. This was not supposed to happen: it has not been reviewed.

@edsantiago
Copy link
Member Author

@nalind @TomSweeneyRedHat @rhatdan I'm sorry to request this, but could you find time to post-review this and let me know if you find glaring problems? It got merged last week due to my misunderstanding about how bors works. If you find any issues with it, or if this is not wanted at all in the tests, I will submit PRs to fix and/or undo.

@rhatdan
Copy link
Member

rhatdan commented Feb 10, 2020

I am fine with the change, my only concern would be, are we still exercising buildah pulling images from a registry?

@edsantiago
Copy link
Member Author

Yes. None of the tests in pull.bats use _prefetch. Also, there are many instances in which I deliberately did not add _prefetch. (Can't remember any instances offhand, because it's been a while, but there were several tests where I thought it would be better to let buildah do the pulling).

@rhatdan
Copy link
Member

rhatdan commented Feb 10, 2020

Ok lets let it go for now, and we will see if we have any issues going forward.

edsantiago added a commit to edsantiago/buildah that referenced this pull request Apr 1, 2020
Fix two issues identified in containers#2036:

 - the 'gitrepo and branch' test was pulling from a place
   that took four minutes; change it to our own repo,
   suggested by Dan, which takes just a few seconds.

   -- also, remove what I think is an unnecessary dup.
      If buildah can pull from a branch, it can pull
      from master.

 - the httpd tests were really confusing, with lots of
   copy/pasted code differing in only small ways. Refactor
   to make the purpose of each test more apparent, and
   to make it easier to add new ones as needed.

Signed-off-by: Ed Santiago <[email protected]>
edsantiago added a commit to edsantiago/buildah that referenced this pull request Apr 1, 2020
Fix three issues identified in containers#2036:

 - the 'gitrepo and branch' test was pulling from a place
   that took four minutes; change it to our own repo,
   suggested by Dan, which takes just a few seconds.

   -- also, remove what I think is an unnecessary dup.
      If buildah can pull from a branch, it can pull
      from master.

 - the httpd tests were really confusing, with lots of
   copy/pasted code differing in only small ways. Refactor
   to make the purpose of each test more apparent, and
   to make it easier to add new ones as needed.

 - combine bud-http-context-dir-with-Dockerfile -pre and -post,
   since they were identical. (Context: they started off being
   different tests, with command-line options in different
   order, but as of containers#493 the -post form of options no longer
   works so the -post test is no longer relevant)

Signed-off-by: Ed Santiago <[email protected]>
bors bot added a commit that referenced this pull request Apr 1, 2020
2265: bud.bats - cleanup, refactoring r=rhatdan a=edsantiago

Fix two issues identified in #2036:

 - the 'gitrepo and branch' test was pulling from a place
   that took four minutes; change it to our own repo,
   suggested by Dan, which takes just a few seconds.

   -- also, remove what I think is an unnecessary dup.
      If buildah can pull from a branch, it can pull
      from master.

 - the httpd tests were really confusing, with lots of
   copy/pasted code differing in only small ways. Refactor
   to make the purpose of each test more apparent, and
   to make it easier to add new ones as needed.

Signed-off-by: Ed Santiago <[email protected]>

/kind cleanup

```release-note
no
```



Co-authored-by: Ed Santiago <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 5, 2023
@edsantiago edsantiago deleted the bats_prefetch branch November 18, 2024 16:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants