Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker BuildKit caching w/ --cache-from fails (roundly 50% rate), even when using docker-container #2279

Open
bozaro opened this issue Jul 26, 2021 · 30 comments

Comments

@bozaro
Copy link

bozaro commented Jul 26, 2021

Version information:

  • OS: Ubuntu 20.04.2 LTS, x86_64
  • Docker Server: 20.10.7
  • Docker Client: 20.10.7
  • docker buildx version: v0.5.1-docker
  • BuildKit: moby/buildkit:v0.9.0

Steps to reproduce

Create Dockerfile:

FROM busybox AS stage-1
RUN echo "Hello, world!!!"
COPY changed.txt /opt/changed.txt

FROM busybox
COPY --from=stage-1 /opt/changed.txt /opt/changed.txt

Run script like (REGISTRY shoud be replaced by actual value):

#!/bin/bash
# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Run 
REGISTRY=registry.example.net/test-docker/example
docker buildx build \
    --builder cachebug \
    --push \
    --tag $REGISTRY:latest \
    --cache-from type=registry,ref=$REGISTRY:buildcache \
    --cache-to type=registry,ref=$REGISTRY:buildcache,mode=max \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

What I see: When I run the above script multiple times, step RUN echo "Hello, world!!!" fails cache roundly every second time for one of platform (I have not seen the problem with the cache at the same time on all platforms).

For example:

 => CACHED [linux/arm64 stage-1 2/3] RUN echo "Hello, world!!!"                                      0.3s
 => => sha256:e2f4ee50b555089a69b84af6621283565af19e3bcf0596b36ba5feec7b96d1d7 116B / 116B           0.2s
 => => sha256:38cc3b49dbab817c9404b9a301d1f673d4b0c2e3497dbcfbea2be77516679682 820.69kB / 820.69kB   0.6s
 => => extracting sha256:38cc3b49dbab817c9404b9a301d1f673d4b0c2e3497dbcfbea2be77516679682            0.1s
 => => extracting sha256:e2f4ee50b555089a69b84af6621283565af19e3bcf0596b36ba5feec7b96d1d7            0.1s
 => [linux/amd64 stage-1 2/3] RUN echo "Hello, world!!!"                                             0.3s
 => [linux/amd64 stage-1 3/3] COPY changed.txt /opt/changed.txt                                      0.2s
 => [linux/arm64 stage-1 3/3] COPY changed.txt /opt/changed.txt                                      0.2s

Update (2021-08-18)

Repository to reproduce issue: https://github.com/bozaro/buildkit-2279 (simply checkout and run ./test.sh).

@bozaro
Copy link
Author

bozaro commented Jul 28, 2021

I make more investigation: looks like cache is corrupted before upload stage.
I comapre bad and good cache data (https://drive.google.com/drive/folders/1hzMWF_qBANvFmf3BeuQKe7KToGUOJp72?usp=sharing) and found litte difference.

Found difference

Good cache

cache_good contains two layers for RUN step:

  • sha256:48ffe3fe97d4a7f3ad3e09f613161e6f1a4f6b836751f0f0c93c8fd5ea92064a (linux/arm64)
  • sha256:52af553f4ee5a60ea433453c95375e457f139988034d49244afcc64b08e3331e (linux/amd64)

Bad cache

cache_bad contains only one layer for RUN step:

  • sha256:52af553f4ee5a60ea433453c95375e457f139988034d49244afcc64b08e3331e (linux/amd64)

Layer graph

graph
good --> g_38cc3[38cc3: FROM arm64]
good --> g_b71f9[b71f9: FROM amd64]
g_38cc3 --> g_48ffe[48ffe: RUN]
g_b71f9 --> g_52af5[52af5: RUN]
g_b71f9 --> g_7c10d[7c10d: COPY]
g_52af5 --> g_7c10d
g_38cc3 --> g_7c10d
g_48ffe --> g_7c10d

bad --> b_38cc3[38cc3: FROM arm64]
bad --> b_b71f9[b71f9: FROM amd64]
b_b71f9 --> b_2b352[2b352: COPY]
b_38cc3 --> b_2b352
b_b71f9 --> b_52af5[52af5: RUN]
b_52af5 --> b_ca41e[ca41e: COPY]
Loading

image

@bozaro
Copy link
Author

bozaro commented Jul 28, 2021

I modified script to check clean cache generation (without --cache-from):

#!/bin/bash
REGISTRY=registry.example.net/test-docker/example

# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Run with cache-to only
docker buildx build \
    --builder cachebug \
    --cache-to type=registry,ref=$REGISTRY:buildcache,mode=max \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Run with cache-from only
docker buildx build \
    --builder cachebug \
    --cache-from type=registry,ref=$REGISTRY:buildcache \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

And this issue is still reproduced randomly.

@bozaro
Copy link
Author

bozaro commented Jul 28, 2021

The problem seems to occur when RUN step produce same BLOB for differ platforms (started and /proc is created at the same second) and COPY step product differ BLOBs (started and /opt is created at the differ second).

So, I modify scripts to dramatically increase the reproducing the problem.

Dockerfile:

FROM busybox AS stage-0

FROM stage-0 AS stage-1
RUN echo "Hello, world!!!"; if [ "$(arch)" = "x86_64" ]; then sleep 2; fi
COPY changed.txt /opt/changed.txt

FROM busybox
COPY --from=stage-1 /opt/changed.txt /opt/copy.txt

Test script:

#!/bin/bash
REGISTRY=registry.example.net/test-docker/example

# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Build stage-0 to pull busybox image
docker buildx build \
    --builder cachebug \
    --platform linux/amd64 \
    --platform linux/arm64 \
    --target stage-0 \
    .

# Run with cache-to only
docker buildx build \
    --builder cachebug \
    --cache-to type=registry,ref=$REGISTRY:buildcache,mode=max \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Run with cache-from only
docker buildx build \
    --builder cachebug \
    --cache-from type=registry,ref=$REGISTRY:buildcache \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

@bozaro
Copy link
Author

bozaro commented Jul 28, 2021

I try save stage-1, latest and buildcache and recover lost cache path (https://drive.google.com/drive/u/0/folders/1pgE0BIQo6PlldnWSDUBtP99xcSvtQAlN).

graph
style copy_stage_amd64 stroke-dasharray: 5 5

root --> amd64[b71f9: FROM amd64]
root --> arm64[38cc3: FROM arm64]
amd64 --> copy_latest_amd64[62dd4: COPY latest, amd64]
amd64 -.-> run[1bb1a: RUN]
arm64 --> run
arm64 --> copy_latest_arm64[f6510: COPY latest, arm64]
run -.-> copy_stage_amd64[01cda: COPY stage-1, amd64]
run --> copy_stage_arm64[f253d: COPY stage-1, arm64]

image

In this case I got single layer for RUN command and differ layers for COPY command.
Dashed links and layers was lost in cache.

@sipsma
Copy link
Collaborator

sipsma commented Jul 28, 2021

I independently encountered some bugs around the parallel creation of blobs for different platforms and have a fix out for PR here: #2273

Have not tried the reproducer here to see if it fixes the problem, but may be worth a try if anyone has time.

@bozaro
Copy link
Author

bozaro commented Jul 29, 2021

I refactor Dockerfile to produce stable layer on RUN stage:

FROM alpine AS base
# RUN command always produce layer with digest `sha256:bcca0bb7...`
RUN echo "Hello, world!!!"; if [ "$(arch)" = "x86_64" ]; then sleep 2; fi; touch -d 2021-01-01 -t 00:00:00 /foo.txt
# COPY command produce differ layer per platform because /opt has differ change time
COPY changed.txt /opt/changed.txt

FROM alpine
COPY --from=base /opt/changed.txt /opt/copy.txt

And remove stage-0 build from script.
But docker buildx build still produce cache with incomplete layers graph randomly (I got 53 failures on 100 runs).

@bozaro
Copy link
Author

bozaro commented Jul 29, 2021

I independently encountered some bugs around the parallel creation of blobs for different platforms and have a fix out for PR here: #2273

Have not tried the reproducer here to see if it fixes the problem, but may be worth a try if anyone has time.

I have build locally #2273 and looks like it not fix this issue.

@sipsma
Copy link
Collaborator

sipsma commented Jul 29, 2021

I have build locally #2273 and looks like it not fix this issue.

Ah okay, good to know, thanks for giving it a try

@bozaro bozaro changed the title Docker BuildKit caching w/ --cache-from fails roundly every second time, even when using docker-container Docker BuildKit caching w/ --cache-from fails (roundly 50% rate), even when using docker-container Aug 8, 2021
bozaro added a commit to bozaro/buildkit-2279 that referenced this issue Aug 18, 2021
@bozaro
Copy link
Author

bozaro commented Aug 18, 2021

I created script to one click reproduce: https://github.com/bozaro/buildkit-2279

@bozaro
Copy link
Author

bozaro commented Oct 26, 2021

I recheck this issue state by https://github.com/bozaro/buildkit-2279 script and now it not flaky.
It stable failed:

e134e8a1d9522d24917077a359c77e7428586adb7be11d16d4c886902c2ecb3b
Wait for service on endpoint: http://localhost:5001
bd194edf668ff7df00eedecb5f9b346af23405cc4e7bf0ce12f91b6aefc59da9
Wait for service on endpoint: http://localhost:5002
Check is build for multiple platform supported...
Run tests...
[1] FAIL
...
[100] FAIL
Cleanup docker images...
cachebug_registry
cachebug_proxy
Show statistics...
FAILED 100 OF 100

@n1ngu
Copy link

n1ngu commented Jul 21, 2022

Given the resolution of #1981 , could you try upgrading your buildx plugin to the 0.8.2 version?

@benlangfeld
Copy link

Has this issue proceeded any further or is there still no known fix? How are people working around this?

@ghost-of-montevideo
Copy link

my fix was simply to not use multi-stage builds, but that of course is not a useful solution for everybody

@ghost-of-montevideo
Copy link

@n1ngu I tried upgrading and the issue persists, multi-stage is broken

@n1ngu
Copy link

n1ngu commented Sep 2, 2022

Ok @ghost-of-montevideo , just check you are not actually under the influence of the confusingly similar #2274

I was under the impression that the present issue is fixed and only affects the outdated buildkit vendored out-of-the-box within docker, as I was unable to reproduce it if I was using an updated buildkit plugin and the docker-container build driver. Unlike the present issue, the reproduction of #2274 is consistent for me.

@benlangfeld
Copy link

benlangfeld commented Oct 11, 2022

@n1ngu The reproduction in https://github.com/bozaro/buildkit-2279 shows that with buildx version 0.9.1 and moby/buildkit:0.10.4 the failure rate is 100% on this issue. Can you show how you were able to validate that this is working, as stated in your last comment?

@n1ngu
Copy link

n1ngu commented Oct 13, 2022

@n1ngu The reproduction in https://github.com/bozaro/buildkit-2279 shows that with buildx version 0.9.1 and moby/buildkit:0.10.4 the failure rate is 100% on this issue. Can you show how you were able to validate that this is working, as stated in your last comment?

@benlangfeld I can't access the building environment I used at the time, but I now think the culprit of the issue would be multi-stage builds, as @ghost-of-montevideo pointed out. All my builds were single-stage so it worked right. I just feel this isn't highlighted enough in the issue description.

@slonopotamus
Copy link
Contributor

I just feel this isn't highlighted enough in the issue description.

Initial issue message has "Steps to reproduce" section. What additional highlighting do you expect?

@n1ngu
Copy link

n1ngu commented Oct 14, 2022

Initial issue message has "Steps to reproduce" section. What additional highlighting do you expect?

Given the context of #1981 and #2274, I'd expect the "multi-stage build triggers the problem, single-stage build works fine" bit of information (not literally) to be both in the title and the description, not only in the reproduction source code.

@kkmoslehpour
Copy link

Is there a solution to this yet? I am also using multistage builds and encountering this issue.

@neilsh
Copy link

neilsh commented Jul 12, 2023

Is there anybody working on this issue, and is there any way others can help? Either this or the related #2274

@ghost-of-montevideo
Copy link

Im also happy to help a bit if there is a way we can contribute, let us know

@slonopotamus
Copy link
Contributor

if there is a way we can contribute, let us know

You can debug the code, find where is the root of the issue, fix it and submit a PR.

@alexwilson1
Copy link

alexwilson1 commented Dec 18, 2023

Spent a while on this issue and could not find a great resolution using buildkit, and there's lots of unsolved issues about this across the web. After seeing #1981 (comment) and another on SO, I switched our GitLab Cloud CI/CD pipelines to use buildah instead of buildkit and caching is working well now.

It's (almost) a drop-in replacement, and using overlay2 as the storage driver, the build performance (caching disabled) seems to be the same as with buildkit.

Here's a sample of what that could look like in a .gitlab-ci.yml file:

.build_template: &build_template
  image: quay.io/buildah/stable
  before_script:
    - buildah login [Login Params] [Registry URL]
  script:
    - buildah build
      --cache-from registry.example.com/myproject/myapp
      --tag registry.example.com/myproject/myapp:${CI_COMMIT_SHA}
      --cache-to registry.example.com/myproject/myapp
      --layers
      --storage-driver overlay2
      -f $DOCKERFILE_PATH $BUILD_CONTEXT_PATH
    - buildah push registry.example.com/myproject/myapp:${CI_COMMIT_SHA}

@fermixbook
Copy link

I recheck this issue state by https://github.com/bozaro/buildkit-2279 script and now it not flaky. It stable failed:

e134e8a1d9522d24917077a359c77e7428586adb7be11d16d4c886902c2ecb3b
Wait for service on endpoint: http://localhost:5001
bd194edf668ff7df00eedecb5f9b346af23405cc4e7bf0ce12f91b6aefc59da9
Wait for service on endpoint: http://localhost:5002
Check is build for multiple platform supported...
Run tests...
[1] FAIL
...
[100] FAIL
Cleanup docker images...
cachebug_registry
cachebug_proxy
Show statistics...
FAILED 100 OF 100

I tried your script and I only got 100 of 100 when I exhausted the number of requests to docker hub, where you logged in when you tried?

@cucxabong
Copy link

hi guys, any update on this

@moltar
Copy link

moltar commented Oct 10, 2024

Experiencing the same.

Tried S3 and registry cache backends.

Two builds run in subsequent CodeBuild projects, which use exactly the same environment deployed via IaC. The only difference is in some build flags, e.g. stage target, and some ARGs.

First build re-uses all of the cache, and build finishes very fast.

Second build, re-uses layer cache of the first few layers, and then ignores the cache of the last command.

I did inspect layers thoroughly using dive and the layer preceding the one that is not getting cached has exactly the same digest across runs.

buildx: 0.17.1
Docker version 26.1.4, build 5650f9b

@ruffsl
Copy link

ruffsl commented Oct 10, 2024

Hey @moltar , your experience sounds much like mine. I've been using docker bake with multistage Dockerfiles, and it's been frustrating that cache restoration is quite flaky, even when simply using either the local file or the remote S3 cache backend. Even when all the build inputs and contexts are deterministicly held constant, caching seems to be hit or miss, and often only partial.

While sifting through related tickets, I believe the main culprit is the issue discussed here:

@thompson-shaun
Copy link
Collaborator

Pulled https://github.com/bozaro/buildkit-2279 and ran test.sh (thanks @bozaro).

...
Cleanup docker images...
cachebug_registry
cachebug_proxy
Show statistics...
FAILED 54 OF 100

@ruffsl
Copy link

ruffsl commented Nov 15, 2024

Hmm, I can't seem to replicate that failure rate on my workstation, (although I commented out the arm64 bits):

$ time ./test.sh
4b2cb5a376919c6b26fe510568c83806ea4b8f89bce4c917f1d49390e439bc9c
Wait for service on endpoint: http://localhost:5001
229f5e7e5e20934dd27f448afdeb70e16397b12d6807c1841e71468b8df8a7a3
Wait for service on endpoint: http://localhost:5002
Check is build for multiple platform supported...
Run tests...
[1] OK
[2] OK
[3] OK
[4] OK
[5] OK
[6] OK
[7] OK
[8] OK
[9] OK
[10] OK
[11] OK
[12] OK
[13] OK
[14] OK
[15] OK
[16] OK
[17] OK
[18] OK
[19] OK
[20] OK
[21] OK
[22] OK
[23] OK
[24] OK
[25] OK
[26] OK
[27] OK
[28] OK
[29] OK
[30] OK
[31] OK
[32] OK
[33] OK
[34] OK
[35] OK
[36] OK
[37] OK
[38] OK
[39] OK
[40] OK
[41] OK
[42] OK
[43] OK
[44] OK
[45] OK
[46] OK
[47] OK
[48] OK
[49] OK
[50] OK
[51] OK
[52] OK
[53] OK
[54] OK
[55] OK
[56] OK
[57] OK
[58] OK
[59] OK
[60] OK
[61] OK
[62] OK
[63] OK
[64] OK
[65] OK
[66] OK
[67] OK
[68] OK
[69] OK
[70] OK
[71] OK
[72] OK
[73] OK
[74] OK
[75] OK
[76] OK
[77] OK
[78] OK
[79] OK
[80] OK
[81] OK
[82] OK
[83] OK
[84] OK
[85] OK
[86] OK
[87] OK
[88] OK
[89] OK
[90] OK
[91] OK
[92] OK
[93] OK
[94] OK
[95] OK
[96] OK
[97] OK
[98] OK
[99] OK
[100] OK
Cleanup docker images...
cachebug_registry
cachebug_proxy
Show statistics...
ALL SUCCESSES

real    12m36.002s
user    1m3.257s
sys     0m48.798s

Although for more complex multistage builds, like I use internally, I certainly can. I'll have to make a similar replicator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests