Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent requests may skip loading some cache paths #4674

Open
tonistiigi opened this issue Feb 21, 2024 · 4 comments
Open

concurrent requests may skip loading some cache paths #4674

tonistiigi opened this issue Feb 21, 2024 · 4 comments

Comments

@tonistiigi
Copy link
Member

tonistiigi commented Feb 21, 2024

This describes the fundamental issue behind docker/buildx#2144 docker/buildx#2265

In BuildKit, all steps from concurrent requests are loaded into the same solve graph. If two requests have some overlapping areas, this is deduplicated; both requests point to the same vertex in a graph with an aggregated reference count.

During a solve, the edge in the graph increments through multiple states. It monitors the parent states(and sets the desired state for them), looks up its cache keys, sees if there are loadable cache records, loads snapshots from the cache, or executes the step. It does it until it has reached the desired state for the current edge(or gets canceled).

The desired state can be initial, cache-fast(all definition-based cache keys loaded), cache-slow(all content based cache keys loaded), complete. The algorithm is designed to find biggest possible cache match and avoid solving edges to complete state unless it is absolutely needed (eg. we may not need to load parent step cache and take it to complete state if we can already determine we have loadable cache for child, this is how in Dockerfile+inline cache you don't need cache for intermediate stages to get cache for the last change).

In the case described in docker/buildx#2144 there are two builds where first is a subset of the first one.

What happens is that first build request is made and starts to be evaluated. As this build has cache sources, it finds that cache already exists and it is loaded (it seems that there might be some progressbar issue where second build overrides the visual "cached" status but I have verified that these steps do load from cache without issues). The result edge of this build goes to completed state.

Now the second build comes and it has new cache sources. The shared parts from the previous build are directly connected to the vertices already in memory. The result edge for new build starts with desiredstate=complete state=initial. Solver will start to find cache keys for this step. To do that it first needs cache keys from the parent step. Normally it would start by setting the desiredstate to cache-fast and then to cache-slow if needed. But in this case, the parent step reports that it is already in the completed step and reports its cache keys. The issue is that these are only the cache keys that were found from the first build and do not contain the keys from the new cache source of the second build. The state machine works in only one direction, if the edge is already in completed state, it never goes back to cache-fast or cache-slow state where it would request more cache keys.

In docker/buildx#2265 workaround is to change the timing so that the result edge of first build does not go to completed state before the child step for the second build has already been loaded to the graph already. But this does not work in all cases, eg. two independent parallel builds with shared steps happening with specific timing and cache conditions.

@flozzone
Copy link

flozzone commented Aug 9, 2024

Hi @tonistiigi I am wondering if there has been any progress on this issue? Is it planned to be fixed? Is there anything I can do?

@flozzone
Copy link

Would using the local cache backend circumvent this problem because cache keys are already present and wouldn't need to be fetched?

@tonistiigi
Copy link
Member Author

@flozzone If you have a reproducer that hits this you can post it for better analysis. You previous repro docker/buildx#2144 should not hit this anymore.

Reading this again, I think one of the possible solution is that builds with different cache sources should have different LLB digests and only merged together in "merging edges" https://github.com/moby/buildkit/blob/master/docs/dev/solver.md#merging-edges phase after computing of cache keys and then potentially finding that they generated same keys. This is a more complicated and less efficient code path (it still deduplicates the actual build containers) , but parallel builds with the same build steps but different cache sources should be quite a rare case.

@brnpimentel
Copy link

What about pull the cache-from image before build?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants