concurrent requests may skip loading some cache paths #4674

tonistiigi · 2024-02-21T04:20:59Z

This describes the fundamental issue behind docker/buildx#2144 docker/buildx#2265

In BuildKit, all steps from concurrent requests are loaded into the same solve graph. If two requests have some overlapping areas, this is deduplicated; both requests point to the same vertex in a graph with an aggregated reference count.

During a solve, the edge in the graph increments through multiple states. It monitors the parent states(and sets the desired state for them), looks up its cache keys, sees if there are loadable cache records, loads snapshots from the cache, or executes the step. It does it until it has reached the desired state for the current edge(or gets canceled).

The desired state can be initial, cache-fast(all definition-based cache keys loaded), cache-slow(all content based cache keys loaded), complete. The algorithm is designed to find biggest possible cache match and avoid solving edges to complete state unless it is absolutely needed (eg. we may not need to load parent step cache and take it to complete state if we can already determine we have loadable cache for child, this is how in Dockerfile+inline cache you don't need cache for intermediate stages to get cache for the last change).

In the case described in docker/buildx#2144 there are two builds where first is a subset of the first one.

What happens is that first build request is made and starts to be evaluated. As this build has cache sources, it finds that cache already exists and it is loaded (it seems that there might be some progressbar issue where second build overrides the visual "cached" status but I have verified that these steps do load from cache without issues). The result edge of this build goes to completed state.

Now the second build comes and it has new cache sources. The shared parts from the previous build are directly connected to the vertices already in memory. The result edge for new build starts with desiredstate=complete state=initial. Solver will start to find cache keys for this step. To do that it first needs cache keys from the parent step. Normally it would start by setting the desiredstate to cache-fast and then to cache-slow if needed. But in this case, the parent step reports that it is already in the completed step and reports its cache keys. The issue is that these are only the cache keys that were found from the first build and do not contain the keys from the new cache source of the second build. The state machine works in only one direction, if the edge is already in completed state, it never goes back to cache-fast or cache-slow state where it would request more cache keys.

In docker/buildx#2265 workaround is to change the timing so that the result edge of first build does not go to completed state before the child step for the second build has already been loaded to the graph already. But this does not work in all cases, eg. two independent parallel builds with shared steps happening with specific timing and cache conditions.

The text was updated successfully, but these errors were encountered:

flozzone · 2024-08-09T13:36:10Z

Hi @tonistiigi I am wondering if there has been any progress on this issue? Is it planned to be fixed? Is there anything I can do?

flozzone · 2024-08-10T04:57:20Z

Would using the local cache backend circumvent this problem because cache keys are already present and wouldn't need to be fetched?

tonistiigi · 2024-08-12T09:40:40Z

@flozzone If you have a reproducer that hits this you can post it for better analysis. You previous repro docker/buildx#2144 should not hit this anymore.

Reading this again, I think one of the possible solution is that builds with different cache sources should have different LLB digests and only merged together in "merging edges" https://github.com/moby/buildkit/blob/master/docs/dev/solver.md#merging-edges phase after computing of cache keys and then potentially finding that they generated same keys. This is a more complicated and less efficient code path (it still deduplicates the actual build containers) , but parallel builds with the same build steps but different cache sources should be quite a rare case.

brnpimentel · 2024-11-14T01:19:29Z

What about pull the cache-from image before build?

tonistiigi mentioned this issue Feb 21, 2024

Bake: Cache of child target gets invalidated by COPY in parent docker/buildx#2144

Closed

3 tasks

ruffsl mentioned this issue Oct 10, 2024

Docker BuildKit caching w/ --cache-from fails (roundly 50% rate), even when using docker-container #2279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrent requests may skip loading some cache paths #4674

concurrent requests may skip loading some cache paths #4674

tonistiigi commented Feb 21, 2024 •

edited

Loading

flozzone commented Aug 9, 2024

flozzone commented Aug 10, 2024

tonistiigi commented Aug 12, 2024

brnpimentel commented Nov 14, 2024

concurrent requests may skip loading some cache paths #4674

concurrent requests may skip loading some cache paths #4674

Comments

tonistiigi commented Feb 21, 2024 • edited Loading

flozzone commented Aug 9, 2024

flozzone commented Aug 10, 2024

tonistiigi commented Aug 12, 2024

brnpimentel commented Nov 14, 2024

tonistiigi commented Feb 21, 2024 •

edited

Loading