Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.13] OTEL socket mount requires new volumes when containerd is run in separate container #4764

Closed
georgethebeatle opened this issue Mar 15, 2024 · 12 comments
Assignees

Comments

@georgethebeatle
Copy link

georgethebeatle commented Mar 15, 2024

Steps to reproduce

  1. Create and target a kind cluster:
kind create cluster
  1. Clone gist
gh gist clone https://gist.github.com/georgethebeatle/487a6e99ab5f9dfc6be493729e0f426c
  1. From the root dir of the gist run:
kbld -f kbld.yaml -f manifest.yaml

What is expected to happen

The kbld command succeeds

What actually happens

The command fails with the follwing output:

repro | starting build (using kubectl buildkit): . -> kbld:rand-1710510549453549629-17512815236153-repro
repro | #1 [internal] booting buildkit
repro | #1 waiting for 1 pods to be ready for buildkit
repro | #1 0.537 Normal         buildkit-6dd8f4bc7d     SuccessfulCreate        Created pod: buildkit-6dd8f4bc7d-z8swt
repro | #1 0.542 Normal         buildkit-6dd8f4bc7d-z8swt       Scheduled       Successfully assigned default/buildkit-6dd8f4bc7d-z8swt to kind-control-plane
repro | #1 0.542 Warning        buildkit-6dd8f4bc7d-z8swt       FailedMount     MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
repro | #1 0.542 Warning        initial attempt to deploy configured for the docker runtime failed, retrying with containerd
repro | #1 1.730 Normal         buildkit-6f5667d48      SuccessfulCreate        Created pod: buildkit-6f5667d48-nwhvn
repro | #1 1.735 Normal         buildkit-6f5667d48-nwhvn        Scheduled       Successfully assigned default/buildkit-6f5667d48-nwhvn to kind-control-plane
repro | #1 1.735 Normal         buildkit-6f5667d48-nwhvn        Pulled  Container image "docker.io/moby/buildkit:buildx-stable-1" already present on machine
repro | #1 1.735 Normal         buildkit-6f5667d48-nwhvn        Created         Created container buildkitd
repro | #1 1.735 Normal         buildkit-6f5667d48-nwhvn        Started         Started container buildkitd
repro | #1 waiting for 1 pods to be ready for buildkit 11.0s done
repro | #1 11.04 All 1 replicas for buildkit online
repro | #1 DONE 11.0s
repro |
repro | #2 [internal] load build definition from Dockerfile
repro | #2 transferring dockerfile: 470B done
repro | #2 DONE 0.0s
repro |
repro | #3 resolve image config for docker-image://docker.io/docker/dockerfile:experimental
repro | #3 DONE 0.7s
repro |
repro | #4 docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5
repro | #4 resolve docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5 0.0s done
repro | #4 CACHED
repro | Error: failed to solve: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/run/buildkit/otel-grpc.sock" to rootfs at "/dev/otel-grpc.sock": stat /run/buildkit/otel-grpc.sock: no such file or directory: unknown
repro | error: exit status 1
repro | finished build (using kubectl buildkit)
kbld: Error:
- Resolving image 'repro': exit status 1

Debugging

We were able to bisect that this PR is introducing the failure

@crazy-max
Copy link
Member

It should be fixed with #4619

Can you try with moby/buildkit:latest? (v0.13.0)

@georgethebeatle
Copy link
Author

georgethebeatle commented Mar 15, 2024

Hey @crazy-max

moby/buildkit:latest (or moby/buildkit:0.13.0) does not seem to be a public docker image. Tehrefore we just checked out v0.13.0, built the images ourselves and side loaded them in kind. Unfortunately the build failed in the same manner.

PS: We get the same failure with master as well.

@crazy-max
Copy link
Member

moby/buildkit:latest (or moby/buildkit:0.13.0) does not seem to be a public docker image.

hum these tags exist:

@tonistiigi tonistiigi added this to the v0.13.1 milestone Mar 16, 2024
@tonistiigi
Copy link
Member

@AkihiroSuda

@brettmorien
Copy link

brettmorien commented Mar 19, 2024

We are using buildkit in the context of a Tiltfile extension kubectl_build, which hides a lot of possible settings. It uses the buildx-stable-1 image and defaults to adding to the CLI --oci-worker=false and --containerd-worker=true.

I've verified in this setup that the current buildx-stable-1, which matches SHA with v0.13.1, doesn't fix this issue if the oci-worker is not enabled, but doesn't repro if I forcibly enable the worker in the container spec.

@tonistiigi
Copy link
Member

@brettmorien Are you using some kind of configuration where containerd is running in a different container/rootfs than buildkit?

@brettmorien
Copy link

We are using this extension for Tilt for local development: https://github.com/tilt-dev/tilt-extensions/tree/master/kubectl_build

How that's translated through all the layers of stuff is a container spec that looks like:

  template:
    metadata:
      labels:
        app: buildkit
        rootless: "false"
        runtime: containerd
        worker: containerd
    spec:
      containers: 
      - args:
        - --oci-worker=false
        - --containerd-worker=true
        - --root
        - /var/lib/buildkit/buildkit
        image: docker.io/moby/buildkit:buildx-stable-1

@tonistiigi
Copy link
Member

@brettmorien Looks like that setup is indeed running container and buildkit in separate containers and setting up some volume mounts in between them. https://github.com/vmware-archive/buildkit-cli-for-kubectl/blob/1db649b1f50268d857d0cfd36335800c72d2cf50/pkg/driver/kubernetes/manifest/manifest.go#L178-L209 With the OTEL socket being in /run/buildkit that would need to be exposed to the other container as well if the buildkit container is not the one launching containers.

@georgethebeatle is your setup somewhat similar to @brettmorien 's ?

@georgethebeatle
Copy link
Author

@tonistiigi we are using the buildkit cli for kubectl configuration of kbld that seems to be boiling down to https://github.com/vmware-archive/buildkit-cli-for-kubectl - the same thing that @brettmorien ends up using, so I guess the setup is effectively the same. Are you suggesting that this is a bug in buildkit-cli-for-kubectl ?

@georgethebeatle
Copy link
Author

georgethebeatle commented Mar 21, 2024

FWIW I tried the same build with a patched version of buildkit cli for kubectl to no avail. I added the following volume and mounts to the code highlighted above:

corev1.Volume{
    Name: "run-buildkit",
    VolumeSource: corev1.VolumeSource{
        HostPath: &corev1.HostPathVolumeSource{
            Path: "/var/run/buildkit",
            Type: &hostPathDirectory,
        },
    },
},
//...
corev1.VolumeMount{
    Name:             "run-buildkit",
    MountPath:        "/run/buildkit",
    MountPropagation: &mountPropagationBidirectional,
},

Then I started getting this error:

cloudfoundry/korifi-controllers | starting build (using kubectl buildkit): . -> trinity.common.repositories.cloud.sap/trinity/korifi-controllers:rand-1711010529163389687-228249185235254-cloudfoundry-korifi-controllers
cloudfoundry/korifi-api | #1 [internal] booting buildkit
cloudfoundry/korifi-controllers | #1 [internal] booting buildkit
cloudfoundry/korifi-api | #1 waiting for 1 pods to be ready for buildkit
cloudfoundry/korifi-controllers | #1 waiting for 1 pods to be ready for buildkit
cloudfoundry/korifi-controllers | #1 0.005 Warning      failed to create configmap configmaps "buildkit" already exists - retrying...
cloudfoundry/korifi-api | #1 0.540 Normal       buildkit-6dd8f4bc7d     SuccessfulCreate        Created pod: buildkit-6dd8f4bc7d-czbrk
cloudfoundry/korifi-api | #1 0.544 Normal       buildkit-6dd8f4bc7d-czbrk       Scheduled       Successfully assigned default/buildkit-6dd8f4bc7d-czbrk to trinity-control-plane
cloudfoundry/korifi-api | #1 0.544 Warning      buildkit-6dd8f4bc7d-czbrk       FailedMount     MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
cloudfoundry/korifi-api | #1 0.544 Warning      initial attempt to deploy configured for the docker runtime failed, retrying with containerd
cloudfoundry/korifi-controllers | #1 0.901 Normal       buildkit-59dcd4b999     SuccessfulCreate        Created pod: buildkit-59dcd4b999-m6wq6
cloudfoundry/korifi-controllers | #1 0.905 Normal       buildkit-59dcd4b999-m6wq6       Scheduled       Successfully assigned default/buildkit-59dcd4b999-m6wq6 to trinity-control-plane
cloudfoundry/korifi-controllers | #1 0.905 Warning      buildkit-59dcd4b999-m6wq6       FailedMount     MountVolume.SetUp failed for volume "run-buildkit" : hostPath type check failed: /var/run/buildkit is not a directory
cloudfoundry/korifi-api | #1 1.407 Normal       buildkit-59dcd4b999     SuccessfulCreate        Created pod: buildkit-59dcd4b999-m6wq6
cloudfoundry/korifi-api |
cloudfoundry/korifi-api | #1 1.412 Normal       buildkit-59dcd4b999-m6wq6       Scheduled       Successfully assigned default/buildkit-59dcd4b999-m6wq6 to trinity-control-plane
cloudfoundry/korifi-api | #1 1.412 Warning      buildkit-59dcd4b999-m6wq6       FailedMount     MountVolume.SetUp failed for volume "run-buildkit" : hostPath type check failed: /var/run/buildkit is not a directory

When I shell into the kind cluster's docker container I cannot find an otel sock anywhere (/var/run/buildkit does not exist indeed)

Another interesting finding is that when I try to run the same build against a remote kubernetes cluster (with the original untampered version of the buildkit cli for kubectl) the build runs successfully. When I shell in the buildkit pod I can see that /run/buildkit/otel-grpc.sock exists. So it looks like there is some difference between kind and non-kind with regards to the otel socket.

@tonistiigi tonistiigi changed the title Buildkit build failing with latest moby/buildkit:buildx-stable-1 image tag [v0.13] OTEL socket mount requires new volumes when containerd is run in separate container Mar 22, 2024
@georgethebeatle
Copy link
Author

@tonistiigi are you sure that the cause is running containerd in separate container? If so wouldn't it also reproduce on remote clusters? For us it only happens on kind.

@thompson-shaun
Copy link
Collaborator

thompson-shaun commented Jun 20, 2024

Closing since this doesn't appear to be directly a buildkit error and related to the outdated tool buildkit-cli-for-kubectl configuring volumes for deploying buildkit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants