Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load sometimes doesn't load #321

Closed
champo opened this issue Mar 26, 2021 · 14 comments · Fixed by docker/buildx#1927
Closed

load sometimes doesn't load #321

champo opened this issue Mar 26, 2021 · 14 comments · Fixed by docker/buildx#1927
Labels
kind/upstream Changes need to be made on upstream project

Comments

@champo
Copy link

champo commented Mar 26, 2021

Behaviour

Trying to run a command with a just built image sometimes fails to find the image:

 $ docker run --rm -t -v "${GITHUB_WORKSPACE}:/src/android/apolloui/build/outputs/" muun_android:latest
Unable to find image 'muun_android:latest' locally
docker: Error response from daemon: pull access denied for muun_android, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

The build step runs ok and has no notable differences in output between correct and failed runs.

Expected behaviour

The muun_android image to be found and run. In https://github.com/muun/apollo/runs/2203961523?check_suite_focus=true it succeded (see the Inspect step cause the build failed due to something unrelated)

Configuration

name: pr
on: pull_request
jobs:
  pr:
    runs-on: ubuntu-20.04
    steps:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@154c24e1f33dbb5865a021c99f1318cfebf27b32
        with:
          buildkitd-flags: --debug

      - name: Checkout
        uses: actions/checkout@5a4ac9002d0be2fb38bd78e4b4dbde5606d7042f

      - name: Build
        uses: docker/build-push-action@9379083e426e2e84abb80c8c091f5cdeb7d3fd7a
        with:
          load: true
          tags: muun_android:latest
          file: android/Dockerfile
          context: .

      - name: Inspect
        run: |
            docker images 
      - name: Build apollo
        run: |
          docker run --rm -t -v "${GITHUB_WORKSPACE}:/src/android/apolloui/build/outputs/" muun_android:latest
      - name: Upload APK
        uses: actions/upload-artifact@e448a9b857ee2131e752b06002bf0e093c65e571
        with:
          name: apk
          path: apk/prod/release/apolloui-prod-release-unsigned.apk

Logs

logs_8.zip

@bensalilijames
Copy link

This is happening to us too. It's super weird because we have three identical workflows set up (with different image names) - two of them succeed but one of them is constantly failing with the above error.

The workflow file:
name: Docker

on:
  push:
    # Publish `staging` as Docker `latest` image.
    branches:
      - staging

    # Publish `v1.2.3` tags as releases.
    tags:
      - v*

env:
  IMAGE_NAME: ml-intents

jobs:
  # Push image to GitHub Packages.
  # See also https://docs.docker.com/docker-hub/builds/
  push:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      # This is the a separate action that sets up buildx runner
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v1

      # So now we can use GitHub actions' own caching for Docker layers!
      - name: Cache Docker layers
        uses: actions/cache@v2
        with:
          path: /tmp/.buildx-cache
          key: ${{ runner.os }}-buildx-${{ env.IMAGE_NAME }}-${{ github.sha }}
          restore-keys: |
            ${{ runner.os }}-buildx-${{ env.IMAGE_NAME }}-

      - name: Build image
        uses: docker/build-push-action@v2
        with:
          builder: ${{ steps.buildx.outputs.name }}
          context: .
          file: intents/Dockerfile
          load: true
          tags: ${{ env.IMAGE_NAME }}:latest
          cache-from: type=local,src=/tmp/.buildx-cache
          cache-to: type=local,dest=/tmp/.buildx-cache-new

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Push image to GitHub Container Registry
        run: |
          IMAGE_ID=ghcr.io/${{ github.repository_owner }}/$IMAGE_NAME

          # Change all uppercase to lowercase
          IMAGE_ID=$(echo $IMAGE_ID | tr '[A-Z]' '[a-z]')

          # Strip git ref prefix from version
          VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')

          # Strip "v" prefix from tag name
          [[ "${{ github.ref }}" == "refs/tags/"* ]] && VERSION=$(echo $VERSION | sed -e 's/^v//')

          # Use Docker `latest` tag convention
          [ "$VERSION" == "staging" ] && VERSION=latest

          echo IMAGE_ID=$IMAGE_ID
          echo VERSION=$VERSION

          echo Listing docker images...
          docker image ls

          echo Tagging image...
          docker tag $IMAGE_NAME:latest $IMAGE_ID:$VERSION
          echo Tagged image successfully!

          echo Pushing image...
          docker push $IMAGE_ID:$VERSION
          echo Pushed image successfully!

      - # Temp fix
        # https://github.com/docker/build-push-action/issues/252
        # https://github.com/moby/buildkit/issues/1896
        name: Move cache
        run: |
          rm -rf /tmp/.buildx-cache
          mv /tmp/.buildx-cache-new /tmp/.buildx-cache

The runner gets to the docker tag $IMAGE_NAME:latest $IMAGE_ID:$VERSION line and errors out with Error response from daemon: No such image: ml-intents:latest as above. docker image ls does not list the built image either.

The two successful workflows have much smaller images (500Mb and 2Gb) whereas the failing image is a lot bigger (5Gb). Could that be an influencing factor here?

@crazy-max
Copy link
Member

@champo @benhjames Cannot repro locally or with GHA. Maybe it fails silently because of insufficient disk space:

Each virtual machine has the same hardware resources available.

  • 2-core CPU
  • 7 GB of RAM memory
  • 14 GB of SSD disk space

You have at your disposal 14GB (actually I would say 9GB by removing the pre-installed middleware) on the runner:

/dev/sdb1        14G  4.1G  9.0G  32% /mnt

Can you add this step at the end of your workflow (before Move cache for you @benhjames) and give me the output:

- name: Disk
  if: always()
  run: |
    df -h
    docker buildx du

@bensalilijames
Copy link

Thanks for investigating @crazy-max! I first added that step and a separate step to list the Docker images, but it still didn't appear to be exported into Docker. The disk space on that run seemed to match yours:

/dev/sdb1        14G  4.1G  9.0G  32% /mnt

I then modified the workflow file to exactly match yours, and the same issue occured.

Then I re-ran the same job, but this time it exported correctly. This was the first run where Docker had cache available (because previous builds before the last one never got a chance to save as it errored upon push to GCR).

I then went back to look at your first run (i.e. without build cache) and noticed that in that particular run it doesn't list the Docker images. So I have a feeling that if there is no build cache, then the export to Docker fails, but if there is build cache, like in your subsequent builds and my last build linked above, then it succeeds. Really weird. Hope that helps...?

@crazy-max
Copy link
Member

crazy-max commented Apr 14, 2021

@benhjames Thanks for your feedback. Yes actually /var/lib/docker uses /dev/root fs which is 99% full on your runner so I presume that's the issue here:

/dev/root        84G   82G  1.2G  99% /

Can you add docker buildx du in the Disk step and give me the output please?

@bensalilijames
Copy link

Thanks @crazy-max, I added that command to both Disk steps (and removed the cache action) and the results can be viewed here. Looks indeed like it runs out of disk space and then silently fails loading into Docker.

Is there anything that you think could be done about this to shrink the disk usage after the build step? I notice that docker buildx du without cache lists Reclaimable: 17.71GB which seems like a lot? How come building with the cache takes up much less space?

Sorry for the questions - would be great to find a solution to this somehow (without reverting back to the plain docker build without cache like I was previously doing before this!)

@crazy-max
Copy link
Member

@benhjames

I notice that docker buildx du without cache lists Reclaimable: 17.71GB which seems like a lot? How come building with the cache takes up much less space?

These are the subsequent instructions cached by buildx for the current builder. You can get more info by using docker buildx du --verbose. If you use an external cache, only the last stage will be cached, so it takes less space and the image can be loaded.

Is there anything that you think could be done about this to shrink the disk usage after the build step?

You could use a self-hosted runner but in the near future you will be able to configure CPU cores, RAM, disk space for the runner (see github/roadmap#161).

Or more drastic, remove some components pre-installed on the runner in your workflow like dotnet (~23GB):

  - name: Remove dotnet
    run: sudo rm -rf /usr/share/dotnet

@bensalilijames
Copy link

Thanks a lot @crazy-max for the help, that's really useful, much appreciated. 🙌

@cep21
Copy link

cep21 commented Apr 22, 2021

Hi,

Thank you for this thread! I was running into the same issue. I would expect an error log of some kind when disk issues happen and the images cannot correctly --load. I couldn't find a buildx issue for this. Is the issue to track this somewhere else, or is the error log there and I'm not finding it.

Thanks!

@bensalilijames
Copy link

Hey @cep21, the issue to track in buildx is docker/buildx#593!

@crazy-max crazy-max added kind/upstream Changes need to be made on upstream project and removed status/needs-investigation labels Apr 22, 2021
@champo
Copy link
Author

champo commented Apr 22, 2021

❤️ Thanks for the deep look into this! I ended up changing the build approach for other reasons which I guess accidentaly reduced the image size, making the issue disappear.

@Nickersoft
Copy link

Hey folks! I believe I'm also hitting this issue – is there currently any workaround other than trying to shrink your image size? I tried sudo rm -rf /usr/share/dotnet, but to no avail. I only have 86% of memory used, but load still isn't loading my image into Docker.

@crazy-max
Copy link
Member

crazy-max commented Mar 20, 2023

@master-bob As discussed in #841, I made some tests using the docker driver and the docker-container driver:

FROM alpine
RUN dd if=/dev/zero of=/tmp/output.dat bs=2048M count=1
RUN dd if=/dev/zero of=/tmp/output2.dat bs=2048M count=1
RUN dd if=/dev/zero of=/tmp/output3.dat bs=2048M count=1
RUN uname -a
jobs:
  build:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        driver:
          - docker
          - docker-container
    steps:
      -
        name: Checkout
        uses: actions/checkout@v3
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
        with:
          driver: ${{ matrix.driver }}
          buildkitd-flags: --debug
      -
        name: Disk
        run: |
          df -h
      -
        name: Build and push
        uses: docker/build-push-action@master
        with:
          context: .
          file: ./fat.Dockerfile
          load: true
          tags: |
            foo
      -
        name: List images
        run: |
          docker image ls
      -
        name: Disk
        if: always()
        run: |
          df -h
          docker buildx du

docker driver

fs before build:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   55G   29G  66% /

docker image ls:

Run docker image ls
REPOSITORY       TAG         IMAGE ID       CREATED          SIZE
foo              latest      1636a6843a99   20 seconds ago   6.45GB
node             18          37b4077cbd8a   11 days ago      997MB
...

fs after build:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   61G   23G  73% /

docker-container driver

fs before build:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   55G   29G  66% /

docker image ls:

Run docker image ls
REPOSITORY       TAG               IMAGE ID       CREATED              SIZE
foo              latest            50f49c8d6cd9   About a minute ago   6.45GB
node             18                37b4077cbd8a   11 days ago          997MB
...

fs after build:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   67G   17G  80% /

As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.

I suggest to use the docker driver in your workflow:

      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
        with:
          driver: docker

@tonistiigi @jedevc, I wonder if we could remove the intermediate tarball when the image is loaded to Docker. WDYT?

@master-bob
Copy link

master-bob commented Mar 20, 2023

As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.

I suggest to use the docker driver in your workflow:

Thank you for the in-depth analysis.

I do have a question. Without using that driver, my understanding is that when using subsequent build-push-actions it will use the cached version if it is available. By changing the driver would this functionality remain the same? Edit: yes, it appears functionality remains the same.

Edit: I think the dotnet location changed on ubuntu-22 as I didn't see any significant change in space usage when attempting to remove. So I opted to remove /usr/local/lib/android/sdk, ~14g, and /opt/hostedtoolcache, ~9g.

Abreviated listing of /opt/hostedtoolcache on ubuntu:latest (22):

489M	/opt/hostedtoolcache/PyPy
1.6G	/opt/hostedtoolcache/go
5.4G	/opt/hostedtoolcache/CodeQL
16K	/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk
378M	/opt/hostedtoolcache/node
62M	/opt/hostedtoolcache/Ruby
1.2G	/opt/hostedtoolcache/Python
9.1G	/opt/hostedtoolcache

Before removing android and the hostedtoolcache:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   54G   30G  65% /

and after

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   31G   53G  37% /

master-bob added a commit to master-bob/docker-android-build-box that referenced this issue Apr 1, 2023
Change to build the image using docker action, this should then allow
the docker action's cache to be used. Subsequently reducing build time
~50%, as the image will only need to be built once. Currently image is
built twice.

The default driver uses double the disk space, see
docker/build-push-action/issues/321 (in brief the image is build in the
build-push-action local cache and then transfered to the local docker).
This is a problem as this image is so large. Using the `docker` driver
will workaround this.
mingchen added a commit to mingchen/docker-android-build-box that referenced this issue Apr 1, 2023
Change to build the image using docker action. Subsequently reducing build time
~50%, as the image will only need to be built once. Currently image is
built twice.

The default driver uses double the disk space, see
docker/build-push-action#321 (in brief, the image is built in the
build-push-action local cache, tared, and then transfered to the local docker).
This is a problem as this image is so large. Using the docker driver
will workaround this.
@saumets
Copy link

saumets commented May 24, 2023

Just wanted to drop a note that I began experiencing this exact same issue today.

In my workflow I build 3 separate docker image(s) with all using the load: true parameter. Also, I was using caching for all the build images like so:

with:
  context: ./nginx
  load: true
  tags: ibp_nginx:latest
  cache-from: type=gha
  cache-to: type=gha, mode=max

Today randomly one of the images was successfully being built but adding a step to inspect docker images -a showed that the image was never being added to docker images. I stumbled upon this thread today while looking for solution. We're also using a custom GHA runner and we had plenty of disk space available, but I tried some of the disk space proposals in this thread to no avail. I also tried deleting my entire GHA repository cache and starting the cache from scratch. No dice.

In the end I noticed this from @crazy-max up above:

As you can see when building with a container builder, Buildx will first create an intermediate tarball and load the image to Docker so that would explain the issue as it would require twice the space (~30GB) in your case.

I suggest to use the docker driver in your workflow:

name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
with:
  driver: docker

Using setup-buildx-action@v2 with driver: docker resolved my issue and finally all the images are being built and available again via load: true. The downside to this of course is that this driver does not support caching from what I can tell.

abkfenris added a commit to abkfenris/jupyter-image that referenced this issue Jul 31, 2023
Additionally updates Pangeo-notebook, adds mamba, and removes tensorflow as it was likely responsible for exploding the build as the Docker image would not actually load

Closes oceanhackweek#71 oceanhackweek#72

Xref docker/build-push-action#321
gaborcsardi added a commit to r-hub/containers that referenced this issue Sep 10, 2023
So there is enough space for the container.
See docker/build-push-action#321
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/upstream Changes need to be made on upstream project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants