RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

AkihiroSuda · 2017-07-07T07:35:10Z

Update (Dec 19, 2017): Children issues can be found here: area/distributed
Update (Nov 13, 2017): newer doc would be in docs/misc/design-distributed-mode.md: [RFC] docs: add design, roadmap, bof notes #160

PTAL
https://docs.google.com/presentation/d/18ZJRm_0h25GP0uvDDEugAeeOkB6x8nOWLVUwBroV0X4/edit?usp=sharing

Agenda:

BuildKit cluster orchestration
Cache placement & Scheduling
Artifact output

Highly related issues:

Add content-based instruction cache #58 (instruction cache)
design: relationship with content store #30 (snapshotter/content store)
add bridge networking #28 (networking)

The text was updated successfully, but these errors were encountered:

mlaventure · 2017-07-07T08:44:57Z

How about (and this is a very swarm centric example):

create a docker network (e.g. buildkit)
create 2 services:
1. One with only a single replica that would be the master
2. Create n workers that would register to the master above via it's container name

The master buildkit binary could start the build as soon as a single worker is connected, and decide to split out the work to other workers as they get added to the queue.

This is just the idea that came to mind after reading your proposal. I am not at all familiar with the BuildKit design though, so this may not make sense at all, in which case just ignore it ;-)

AkihiroSuda · 2017-07-07T09:02:35Z

@mlaventure
Thank you for the comment.
BuildKit worker containers (moby/buildkit:worker) need to create real workload containers (e.g. golang:1.8) via bind-mount containerd.sock (or "CinC"), and also need to connect them to an arbitrary CNI/CNM network.
I haven't looked into CNI/CNM deeper, but is it easily possible?

mlaventure · 2017-07-07T09:06:08Z

@AkihiroSuda not sure for CNI/CNM, you would have to ask the network people :)

My comment was about a 3rd mode compared to the ones you proposed. Mainly, to have a fix set of workers assigned instead of requiring buildkit to have access to the containerd socket so it can spawn them.

dnephin · 2017-07-07T15:10:58Z

My 2 cents: there is very little value in splitting a single build across multiple nodes. If I'm understanding the proposal correctly, it seems to suggest that.

Most builds will not be any faster, and many builds will actually be slower because they have to wait for the cache to be distributed to a different node. Most builds are slow because of data transfer, so moving individual steps to different nodes is not actually going to speed anything up. We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

I think the more valuable "distributed build" feature is a "worker cluster" where a single master can be sent multiple builds, and each build might end up on a different node, but the entire build happens on a single node, and any artifacts are streamed back to the client.

I think that makes Topic 2: Cache placement much easier because you only need to distribute it before/after a build, instead of after each step.

Question about Topic 3, why would the artifacts be stored by buildkit? Wouldn't they be streamed back to the client that requested the build? An option to send to a registry also sounds pretty good.

tonistiigi · 2017-07-07T17:10:22Z

topic1: I had something similar in mind as @mlaventure . We shouldn't spawn orchestration workers, that's a responsibility of some other tooling. I'm not sure I get the networking comment. The buildkit-workers themselves need to be on the same network but why would this be required for containers executing users processes? The workers do need to have access to containerd(or equal permissions) but it could be a secure subset of containerd api as well.

topic2: I'm not sure if using registry only would give the performance we are after. Without the registry, at least in theory, the overhead could be minimal in the future with better snapshot drivers. With the registry, we always have a cost of 2 additional writes and push/pull sync problem. It would simplify things a lot though. We should leave master HA out of this for now as it is a different topic.

topic3: I agree with @dnephin that this mostly depends on the exporter, worker itself should only need to worry about where it gets/puts the cache. For the push to a registry(or any central storage option), buildkit already should support exporting cache to any registry. So maybe this is just a way to configure master so that it automatically makes the current cache state HA(and removing local worker cache) in the background.

@dnephin

As I understand from topic2 example, the metadata of the cache is still kept on the master, and it is used for assigning workers. So possible copy operations only happen on branch splits. If the user has a single threaded definition, it should almost always only use one worker. Our tooling should discourage creating these single threaded definitions, but if the user still chooses to use them, their performance should be at least same as current docker build.

I think the more valuable "distributed build" feature is a "worker cluster" where a single master can be sent multiple builds, and each build might end up on a different node, but the entire build happens on a single node,

The current solver already doesn't separate requests from different clients and uses a shared graph. The logic would be same, just it is based on dependency chain, not on request identity.

We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

We should encourage bigger definitions because this gives us more information to make better decisions. The goal should be that splitting graph to more vertexes always improves performance and cache. Resulting artifact size rarely has a correlation with the graph size.

AkihiroSuda · 2017-07-11T06:23:49Z

@dnephin

We also shouldn't be encouraging anyone to build images that are so large they require more than a single node to build.

I agree, but I think the future trend in the container world will be encouraging users to build "bundle" of images (i.e. DAB/Compose, or Helm) rather than a single image.
Such bundles would have larger LLB DAG and users will benefit from multi-node builder.

Question about Topic 3, why would the artifacts be stored by buildkit? Wouldn't they be streamed back to the client that requested the build? An option to send to a registry also sounds pretty good.

If "client" stands for something like /usr/bin/docker, streaming back to the "client" would cause extra network traffic.

What we need would be a mechanism to tell client "The worker X holds the cache you requested to build, so you can call Export() against the worker X and push the cache to a registry as an image (or other kind of exportation operation)."

@tonistiigi

The buildkit-workers themselves need to be on the same network but why would this be required for containers executing users processes? The workers do need to have access to containerd(or equal permissions) but it could be a secure subset of containerd api as well.

docker build --network would require connecting workload containers (e.g. golang:1.8) to an arbitrary network, which could not be just achived by just bind-mounting containerd.sock into moby/buildkit:worker containers.

Even if we can launch moby/buildkit:worker containers with --privileged (which is not supported in the current Docker Swarm-mode), we might still need some orchestrator-specific driver?

(cc @ijc PTAL if you are interested in - moby/swarmkit#2299 seems slightly related to this topic)

I'm not sure if using registry only would give the performance we are after.

My suggestion is not to use registry, and transfer caches across workers directly.
I just mentioned using registry as a possible alternative design.

Also, can I hear your opinion about another alternative design: using gossip-like protocol?
Using a gossip-like protocol would remove dependency for etcd / shared volume, and it might simplify deployment & operation. (though we need to take care of the bootstrap node unless we can rely on mDNS)

For implementation, Serf might be used.
According to https://groups.google.com/forum/#!topic/serfdom/GlTqAshmK_w , Serf has good scalability w.r.t. number of nodes: "for clusters of even thousands of nodes, delivery is still sub-second. "
However, due to limitation of UDP, its event size cannot exceed 512 bytes: https://github.com/hashicorp/serf/blob/60b345945e1a3453f219114823537ee2ceeb3500/serf/serf.go#L226
So not sure it is suitable for cachemap.

tonistiigi · 2017-07-11T06:40:45Z

docker build --network would require connecting workload containers (e.g. golang:1.8) to an arbitrary network, which could not be just achived by just bind-mounting containerd.sock into moby/buildkit:worker containers.

We knew that --network would not be portable and that's why we resisted against that as long as we could. I wouldn't make any design decision based on that obscure feature. If the user wants to build with these specific settings with docker build, docker can prepare a custom worker instance for it that is connected to some local namespace and that can never be used for distributed workflows.

AkihiroSuda · 2017-07-18T07:55:59Z

Can we make this forward with master-worker model?
The first task would be to split the master daemon (solver, session mgr) from the worker daemon.
Then we can incrementally add support for multi-worker and multi-master.

Or do we need more design discussion?

tonistiigi · 2017-07-18T17:45:37Z

Can we make this forward with master-worker model?

yes

The first task would be to split the master daemon (solver, session mgr) from the worker daemon.
Then we can incrementally add support for multi-worker and multi-master.

Maybe first step would be to just have multiple instances or worker/snapshotter inside the same binary. Then, even when splitting up a worker binary it could (in the beginning) be an optional binary for using the remote cases. This makes sure that we don't get stuck on implementing the other features because grpc limitations. I'd like the grpc API to influence the rest of the design as little as possible.

AkihiroSuda · 2017-08-07T09:28:44Z

@tonistiigi

Maybe first step would be to just have multiple instances or worker/snapshotter inside the same binary.

Sorry for my recent inactivity.
Here is my WIP branch to put multiple controller instances to a single buildd binary: https://github.com/moby/buildkit/compare/master...AkihiroSuda:plugin.20170807.1?expand=1

Please let me know the design SGTY before opening a PR?

tonistiigi · 2017-08-07T18:08:46Z

@AkihiroSuda What's the use case for specifying a controller on the client side. My impression was that there would be a single controller, solver, instructioncache. And multiple snapshot/worker/contenthash implementations. Constraints for finding a worker would be defined per vertex, not for build job. Is there anything I'm missing that makes it impossible?

AkihiroSuda · 2017-08-08T05:07:39Z

My first idea was to incrementally add ExecuteVertex() to api/services/control and let the master daemon to call ExecuteVertex() with GRPC metadata against worker daemons.
But on the second thought I found this design is not straightforward; we should just set metadata against per vertex as you pointed out, and also we shouldn't put daemon-to-daemon RPCs to api/services/control.

I'll update my WIP branch.

AkihiroSuda · 2017-09-07T07:48:56Z

opened #114 for initial per-vertex metadata

AkihiroSuda · 2017-11-13T09:56:47Z

opened #160 for more detailed roadmap

AkihiroSuda · 2017-12-19T08:45:25Z

Children issues:

Cache transfer #224 (cache transfer)
Cluster management (membership & cache query) #231 (membership & cache query)

AkihiroSuda · 2019-04-23T10:38:07Z

Poor man's distributed mode is here: #956 (consistent hashing for each of build invocation)

drizzd · 2019-05-01T10:23:48Z

I am using Gitlab CI with the Kubernetes executor to create build pods in a Kuberentes cluster. I configure the builds to execute docker build on the Kubernetes node (by mounting /var/run/docker.sock hostpath) so that the builds can re-use the local Docker cache from previous builds on the same node. Kubernetes takes care of load balancing the build pods. It does not consider the actual resource usage, but at least it will distribute the pods evenly across nodes.

The trouble with this approach is that the very same build needs to run on each of the nodes until the build is cached on all of them. Note that I have many rebuilds without changes because we are using a monorepo/microservice structure.

The consistent hash approach is a possible solution. But I have some Dockerfiles which take much longer to build than others. The build for a given Dockerfile will always runs on the same buildkitd pod, and therefore the load will not be balanced across nodes. Furthermore, there is a chance that some nodes will not be used at all.

Instead, I would like to synchronize caches via the registry, as discussed in #723. I could run buildkitd as a daemonset and configure build pods to connect to the buildkitd which is running on the same node.

What do you think?

AkihiroSuda · 2019-05-01T13:24:40Z

registry cache is already implemented https://github.com/moby/buildkit#tofrom-registry

drizzd · 2019-05-05T21:07:03Z

Yes, registry cache is implemented. I am just pointing out that it is not clear how to do the load balancing on Kubernetes. Is it best to run one buildkitd instance per node (i.e., daemonset)? Or does it make sense to have multiple instances per node?

Instead of the daemonset, each build pod could have a buildkitd and a buildctl container. The buildctl container would have to wait for the buildkitd container to become ready. I would also have to give up on local caches, but the bigger issue is that it seems like an abuse of the buildkit architecture.

This would be much easier, for example, with img, since I can run the build directly in a pod as a standalone container. But img does not support caching the build layers in a registry.

drizzd · 2019-05-05T21:20:46Z

Oh, and it would be awesome if buildkit could be used with the Horizontal Pod Autoscaler. Again, I think this only works if the build is executed in the context of the build pod.

Actually, I want to use buildkitd/buildctl just like I would use dockerd/docker-run. The daemon runs on each node in the Kubernetes cluster. And the resources used by each build are associated with the build, and not with the daemon process.

AkihiroSuda · 2019-12-13T05:14:22Z

https://github.com/docker/buildx already has support for deploying BuildKit on Kubernetes.
Probably we can close this issue.

More info: https://www.slideshare.net/AkihiroSuda/kubeconus2019-docker-inc-booth-distributed-builds-on-kubernetes-with-buildkit-and-docker-buildx-196169581

AkihiroSuda mentioned this issue Jul 7, 2017

discussion: builder future: buildkit moby/moby#32925

Closed

AkihiroSuda mentioned this issue Aug 10, 2017

Proposal: Use graph search to find valid chains of relevant docker build cache hits moby/moby#20304

Closed

AkihiroSuda mentioned this issue Sep 7, 2017

llb: per-vertex metadata (e.g. IgnoreCache) #114

Merged

This was referenced Nov 17, 2017

add bridge networking #28

Open

[WIP] Bridge networking for containerd workers #174

Closed

AkihiroSuda added the area/distributed label Dec 18, 2017

AkihiroSuda mentioned this issue Dec 19, 2017

worker: add annotation labels #232

Merged

AkihiroSuda mentioned this issue Apr 3, 2018

Caching (question / enhancement) containerbuilding/cbi#9

Open

RytisLT mentioned this issue Sep 10, 2018

Error on cache query #569

Closed

AkihiroSuda closed this as completed Dec 13, 2019

goller pushed a commit to goller/buildkit that referenced this issue Oct 20, 2023

Merge pull request moby#62 from depot/larger-ci

01356a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

AkihiroSuda commented Jul 7, 2017 •

edited

Loading

mlaventure commented Jul 7, 2017

AkihiroSuda commented Jul 7, 2017

mlaventure commented Jul 7, 2017

dnephin commented Jul 7, 2017

tonistiigi commented Jul 7, 2017

AkihiroSuda commented Jul 11, 2017

tonistiigi commented Jul 11, 2017

AkihiroSuda commented Jul 18, 2017

tonistiigi commented Jul 18, 2017

AkihiroSuda commented Aug 7, 2017

tonistiigi commented Aug 7, 2017

AkihiroSuda commented Aug 8, 2017

AkihiroSuda commented Sep 7, 2017

AkihiroSuda commented Nov 13, 2017

AkihiroSuda commented Dec 19, 2017

AkihiroSuda commented Apr 23, 2019 •

edited

Loading

drizzd commented May 1, 2019

AkihiroSuda commented May 1, 2019

drizzd commented May 5, 2019

drizzd commented May 5, 2019

AkihiroSuda commented Dec 13, 2019

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

RFC: Distributed BuildKit (Swarm/Kubernetes/Mesos..) #62

Comments

AkihiroSuda commented Jul 7, 2017 • edited Loading

mlaventure commented Jul 7, 2017

AkihiroSuda commented Jul 7, 2017

mlaventure commented Jul 7, 2017

dnephin commented Jul 7, 2017

tonistiigi commented Jul 7, 2017

AkihiroSuda commented Jul 11, 2017

tonistiigi commented Jul 11, 2017

AkihiroSuda commented Jul 18, 2017

tonistiigi commented Jul 18, 2017

AkihiroSuda commented Aug 7, 2017

tonistiigi commented Aug 7, 2017

AkihiroSuda commented Aug 8, 2017

AkihiroSuda commented Sep 7, 2017

AkihiroSuda commented Nov 13, 2017

AkihiroSuda commented Dec 19, 2017

AkihiroSuda commented Apr 23, 2019 • edited Loading

drizzd commented May 1, 2019

AkihiroSuda commented May 1, 2019

drizzd commented May 5, 2019

drizzd commented May 5, 2019

AkihiroSuda commented Dec 13, 2019

AkihiroSuda commented Jul 7, 2017 •

edited

Loading

AkihiroSuda commented Apr 23, 2019 •

edited

Loading