-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62
Comments
How about (and this is a very swarm centric example):
The master buildkit binary could start the build as soon as a single worker is connected, and decide to split out the work to other workers as they get added to the queue. This is just the idea that came to mind after reading your proposal. I am not at all familiar with the BuildKit design though, so this may not make sense at all, in which case just ignore it ;-) |
@mlaventure |
@AkihiroSuda not sure for CNI/CNM, you would have to ask the network people :) My comment was about a 3rd mode compared to the ones you proposed. Mainly, to have a fix set of workers assigned instead of requiring |
My 2 cents: there is very little value in splitting a single build across multiple nodes. If I'm understanding the proposal correctly, it seems to suggest that. Most builds will not be any faster, and many builds will actually be slower because they have to wait for the cache to be distributed to a different node. Most builds are slow because of data transfer, so moving individual steps to different nodes is not actually going to speed anything up. We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build. I think the more valuable "distributed build" feature is a "worker cluster" where a single master can be sent multiple builds, and each build might end up on a different node, but the entire build happens on a single node, and any artifacts are streamed back to the client. I think that makes Topic 2: Cache placement much easier because you only need to distribute it before/after a build, instead of after each step. Question about Topic 3, why would the artifacts be stored by buildkit? Wouldn't they be streamed back to the client that requested the build? An option to send to a registry also sounds pretty good. |
topic1: I had something similar in mind as @mlaventure . We shouldn't spawn orchestration workers, that's a responsibility of some other tooling. I'm not sure I get the networking comment. The buildkit-workers themselves need to be on the same network but why would this be required for containers executing users processes? The workers do need to have access to containerd(or equal permissions) but it could be a secure subset of containerd api as well. topic2: I'm not sure if using registry only would give the performance we are after. Without the registry, at least in theory, the overhead could be minimal in the future with better snapshot drivers. With the registry, we always have a cost of 2 additional writes and push/pull sync problem. It would simplify things a lot though. We should leave master HA out of this for now as it is a different topic. topic3: I agree with @dnephin that this mostly depends on the exporter, worker itself should only need to worry about where it gets/puts the cache. For the push to a registry(or any central storage option), buildkit already should support exporting cache to any registry. So maybe this is just a way to configure master so that it automatically makes the current cache state HA(and removing local worker cache) in the background. As I understand from topic2 example, the metadata of the cache is still kept on the master, and it is used for assigning workers. So possible copy operations only happen on branch splits. If the user has a single threaded definition, it should almost always only use one worker. Our tooling should discourage creating these single threaded definitions, but if the user still chooses to use them, their performance should be at least same as current
The current solver already doesn't separate requests from different clients and uses a shared graph. The logic would be same, just it is based on dependency chain, not on request identity.
We should encourage bigger definitions because this gives us more information to make better decisions. The goal should be that splitting graph to more vertexes always improves performance and cache. Resulting artifact size rarely has a correlation with the graph size. |
I agree, but I think the future trend in the container world will be encouraging users to build "bundle" of images (i.e. DAB/Compose, or Helm) rather than a single image.
If "client" stands for something like What we need would be a mechanism to tell client "The worker X holds the cache you requested to build, so you can call
Even if we can launch (cc @ijc PTAL if you are interested in - moby/swarmkit#2299 seems slightly related to this topic)
My suggestion is not to use registry, and transfer caches across workers directly. Also, can I hear your opinion about another alternative design: using gossip-like protocol? For implementation, Serf might be used. |
We knew that |
Can we make this forward with master-worker model? Or do we need more design discussion? |
yes
Maybe first step would be to just have multiple instances or worker/snapshotter inside the same binary. Then, even when splitting up a worker binary it could (in the beginning) be an optional binary for using the remote cases. This makes sure that we don't get stuck on implementing the other features because grpc limitations. I'd like the grpc API to influence the rest of the design as little as possible. |
Sorry for my recent inactivity. Please let me know the design SGTY before opening a PR? |
@AkihiroSuda What's the use case for specifying a controller on the client side. My impression was that there would be a single controller, solver, instructioncache. And multiple snapshot/worker/contenthash implementations. Constraints for finding a worker would be defined per vertex, not for build job. Is there anything I'm missing that makes it impossible? |
My first idea was to incrementally add I'll update my WIP branch. |
opened #114 for initial per-vertex metadata |
opened #160 for more detailed roadmap |
Children issues:
|
Poor man's distributed mode is here: #956 (consistent hashing for each of build invocation) |
I am using Gitlab CI with the Kubernetes executor to create build pods in a Kuberentes cluster. I configure the builds to execute docker build on the Kubernetes node (by mounting /var/run/docker.sock hostpath) so that the builds can re-use the local Docker cache from previous builds on the same node. Kubernetes takes care of load balancing the build pods. It does not consider the actual resource usage, but at least it will distribute the pods evenly across nodes. The trouble with this approach is that the very same build needs to run on each of the nodes until the build is cached on all of them. Note that I have many rebuilds without changes because we are using a monorepo/microservice structure. The consistent hash approach is a possible solution. But I have some Dockerfiles which take much longer to build than others. The build for a given Dockerfile will always runs on the same buildkitd pod, and therefore the load will not be balanced across nodes. Furthermore, there is a chance that some nodes will not be used at all. Instead, I would like to synchronize caches via the registry, as discussed in #723. I could run buildkitd as a daemonset and configure build pods to connect to the buildkitd which is running on the same node. What do you think? |
registry cache is already implemented https://github.com/moby/buildkit#tofrom-registry |
Yes, registry cache is implemented. I am just pointing out that it is not clear how to do the load balancing on Kubernetes. Is it best to run one buildkitd instance per node (i.e., daemonset)? Or does it make sense to have multiple instances per node? Instead of the daemonset, each build pod could have a buildkitd and a buildctl container. The buildctl container would have to wait for the buildkitd container to become ready. I would also have to give up on local caches, but the bigger issue is that it seems like an abuse of the buildkit architecture. This would be much easier, for example, with img, since I can run the build directly in a pod as a standalone container. But img does not support caching the build layers in a registry. |
Oh, and it would be awesome if buildkit could be used with the Horizontal Pod Autoscaler. Again, I think this only works if the build is executed in the context of the build pod. Actually, I want to use buildkitd/buildctl just like I would use dockerd/docker-run. The daemon runs on each node in the Kubernetes cluster. And the resources used by each build are associated with the build, and not with the daemon process. |
https://github.com/docker/buildx already has support for deploying BuildKit on Kubernetes. |
docs/misc/design-distributed-mode.md
: [RFC] docs: add design, roadmap, bof notes #160PTAL
https://docs.google.com/presentation/d/18ZJRm_0h25GP0uvDDEugAeeOkB6x8nOWLVUwBroV0X4/edit?usp=sharing
Agenda:
Highly related issues:
The text was updated successfully, but these errors were encountered: