Skip to content
This repository has been archived by the owner on Jun 28, 2023. It is now read-only.

unable to create bootstrap cluster: failed to create kind cluster tkg-kind- #2138

Closed
syangsao opened this issue Oct 4, 2021 · 14 comments · Fixed by #2220
Closed

unable to create bootstrap cluster: failed to create kind cluster tkg-kind- #2138

syangsao opened this issue Oct 4, 2021 · 14 comments · Fixed by #2220
Assignees
Labels
kind/docs A change in documentation owner/docs Work executed by VMware documentation team
Milestone

Comments

@syangsao
Copy link

syangsao commented Oct 4, 2021

Bug Report

Installation fails with the following error:

tanzu management-cluster create --ui

Validating the pre-requisites...
Serving kickstart UI at http://127.0.0.1:8080
Identity Provider not configured. Some authentication features won't work.
Validating configuration...
web socket connection established
sending pending 2 logs to UI
Using infrastructure provider docker:v0.3.23
Generating cluster configuration...
Setting up bootstrapper...
unable to set up management cluster, : unable to create bootstrap cluster: failed to create kind cluster tkg-kind-c5dmk170futgf1a95cv0: failed to init node with kubeadm: command "docker exec --privileged tkg-kind-c5dmk170futgf1a95cv0-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

Expected Behavior

kind cluster should run through and install

Steps to Reproduce the Bug

  1. Install Fedora 34 workstation with base rpms
  2. Install docker CE following https://docs.docker.com/engine/install/fedora/
  3. sudo usermod -a -G docker <username>
  4. Reboot workstation
  5. Download and extract the tce-linux-amd64-v0.9.1.tar.gz
  6. Run the following command tanzu management-cluster create --ui

Screenshots or additional information and context

image

Environment Details

  • Build version (tanzu version): v0.2.1
  • Deployment (Managed/Standalone cluster): Standalone
  • Infrastructure Provider (Docker/AWS/Azure/vSphere): Docker
  • Operating System (client): Fedora 34 (5.14.9-200.fc34.x86_64)

Diagnostics and log bundle

The tanzu diagnostics collect fails to capture anything.

 syangsao  ~  tanzu diagnostics collect
2021/10/04 16:02:09 Collecting bootstrap cluster diagnostics
2021/10/04 16:02:09 Error: kind program binary not found
2021/10/04 16:02:09 Error: One or more required program(s) missing
2021/10/04 16:02:09 Warn: skipping management cluster diagnostics: management cluster: name not set
2021/10/04 16:02:09 Warn: skipping workload cluster diagnostics: workload cluster: name not set
@syangsao syangsao added kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers labels Oct 4, 2021
@github-actions
Copy link

github-actions bot commented Oct 4, 2021

Hey @syangsao! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Tanzu Community Edition.

@figo
Copy link

figo commented Oct 4, 2021

Hi @syangsao , could you do the following steps to help us understand the issue better.

  1. in your setup, install https://github.com/kubernetes-sigs/kind the kind cli.
  2. run kind create cluster.

if you see failure, which means your setup is not ready to create kind cluster,
if kind cluster can be created successfully with https://github.com/kubernetes-sigs/kind kind cli in your setup, we need to get more logs by running tanzu management-cluster create -v 9

@syangsao
Copy link
Author

syangsao commented Oct 4, 2021

Installed kind and ran the kind cluster create command without any failures.

The tanzu management-cluster create -v 9 shows the following output. I have not cleaned up the last attempted installation yet and it is still running in another window with the exit status 1 error. Should I restart the installation?

Screenshot from 2021-10-04 18-37-40

@figo
Copy link

figo commented Oct 5, 2021

@syangsao sorry, please try to run kind create cluster to actually create the kind cluster.

@tvanderka
Copy link

Just a guess, this is caused by old containerd 1.3.x runtime in vmware kind/node image that does not support cgroupv2. Kind with containerd 1.5 works fine on fedora 34. Similar issue from minikube kubernetes/minikube#11310
Exec into kind container while install is running and look at journalctl -fu containerd

@mstefany
Copy link

mstefany commented Oct 7, 2021

It seems it is really related to cgroup v2 (running Fedora 34):

Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.036761050Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:kube-controller-manager-tkg-kind-c5fjg6v01b0qpem9ta20-control-plane,Uid:c20aaad9a0dba937a81e7bfdf96beb26,Namespace:kube-system,Attempt:0,}"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.058588353Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/containerd/io.containerd.runtime.v2.task/k8s.io/e1940f974520601a6365a52411c223c39c23729be4874b2d6eb3bbf2f0006d12 pid=2767
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.127894158Z" level=error msg="loading cgroup for 2791" error="cgroups: cgroup mountpoint does not exist"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.374520447Z" level=error msg="loading cgroup for 2791" error="cgroups: cgroup mountpoint does not exist"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.377250053Z" level=info msg="shim disconnected" id=e1940f974520601a6365a52411c223c39c23729be4874b2d6eb3bbf2f0006d12
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.377303980Z" level=warning msg="cleaning up after shim disconnected" id=e1940f974520601a6365a52411c223c39c23729be4874b2d6eb3bbf2f0006d12 namespace=k8s.io
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.377316494Z" level=info msg="cleaning up dead shim"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.377362452Z" level=error msg="Failed to delete sandbox container \"e1940f974520601a6365a52411c223c39c23729be4874b2d6eb3bbf2f0006d12\"" error="ttrpc: closed: unknown"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.381549023Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-controller-manager-tkg-kind-c5fjg6v01b0qpem9ta20-control-plane,Uid:c20aaad9a0dba937a81e7bfdf96beb26,Namespace:kube-system,Attempt:0,} failed, error" error="failed to start sandbox container task \"e1940f974520601a6365a52411c223c39c23729be4874b2d6eb3bbf2f0006d12\": ttrpc: closed: unknown"
Oct 07 18:12:53 tkg-kind-c5fjg6v01b0qpem9ta20-control-plane containerd[100]: time="2021-10-07T18:12:53.501095979Z" level=warning msg="cleanup warnings time=\"2021-10-07T18:12:53Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=2807\n"

@syangsao
Copy link
Author

syangsao commented Oct 7, 2021

@syangsao sorry, please try to run kind create cluster to actually create the kind cluster.

My bad, my syntax was incorrect. Re-ran kind create cluster and verified that runs fine.

image

The installation still fails at the same error.

tanzu management-cluster create --ui

Downloading TKG compatibility file from 'projects.registry.vmware.com/tkg/framework-zshippable/tkg-compatibility'
Downloading the TKG Bill of Materials (BOM) file from 'projects.registry.vmware.com/tkg/tkg-bom:v1.4.0'
Downloading the TKr Bill of Materials (BOM) file from 'projects.registry.vmware.com/tkg/tkr-bom:v1.21.2_vmware.1-tkg.1'
ERROR 2021/10/07 14:19:35 svType != tvType; key=release, st=map[string]interface {}, tt=<nil>, sv=map[version:], tv=<nil>

Validating the pre-requisites...
Serving kickstart UI at http://127.0.0.1:8080
Identity Provider not configured. Some authentication features won't work.
Validating configuration...
web socket connection established
sending pending 2 logs to UI
Using infrastructure provider docker:v0.3.23
Generating cluster configuration...
Setting up bootstrapper...
unable to set up management cluster, : unable to create bootstrap cluster: failed to create kind cluster tkg-kind-c5fkgmn0futit0abc4ag: failed to init node with kubeadm: command "docker exec --privileged tkg-kind-c5fkgmn0futit0abc4ag-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

journalctl -fu containerd shows the following:

-- Journal begins at Mon 2020-05-18 08:20:31 CDT. --
Oct 07 13:07:10 degobah containerd[1029]: time="2021-10-07T13:07:10.470406346-05:00" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
Oct 07 13:07:10 degobah containerd[1029]: time="2021-10-07T13:07:10.470413854-05:00" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
Oct 07 13:07:10 degobah containerd[1029]: time="2021-10-07T13:07:10.471241795-05:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Oct 07 13:07:10 degobah containerd[1029]: time="2021-10-07T13:07:10.471295452-05:00" level=info msg=serving... address=/run/containerd/containerd.sock
Oct 07 13:07:10 degobah containerd[1029]: time="2021-10-07T13:07:10.471761532-05:00" level=info msg="containerd successfully booted in 0.338183s"
Oct 07 13:07:10 degobah systemd[1]: Started containerd container runtime.
Oct 07 14:10:16 degobah containerd[1029]: time="2021-10-07T14:10:16.852480151-05:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/cddd706f33140f00f29dc637d09d92c112c8efd2d975905d71f9a7d74e131d5c pid=4731
Oct 07 14:14:06 degobah containerd[1029]: time="2021-10-07T14:14:06.129039397-05:00" level=info msg="shim disconnected" id=cddd706f33140f00f29dc637d09d92c112c8efd2d975905d71f9a7d74e131d5c
Oct 07 14:14:06 degobah containerd[1029]: time="2021-10-07T14:14:06.129081387-05:00" level=error msg="copy shim log" error="read /proc/self/fd/11: file already closed"
Oct 07 14:20:19 degobah containerd[1029]: time="2021-10-07T14:20:19.192025058-05:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/53aa7941d1fae008cb7cb662fcaf28086366b5f873ac4f6d16586688953fff43 pid=9487
Oct 07 14:24:22 degobah containerd[1029]: time="2021-10-07T14:24:22.537754905-05:00" level=info msg="shim disconnected" id=53aa7941d1fae008cb7cb662fcaf28086366b5f873ac4f6d16586688953fff43
Oct 07 14:24:22 degobah containerd[1029]: time="2021-10-07T14:24:22.537835036-05:00" level=error msg="copy shim log" error="read /proc/self/fd/11: file already closed"

@anibal-aguila
Copy link

anibal-aguila commented Oct 7, 2021

same issue here >

logs CONTAINERID'couldn't initialize a Kubernetes clusterk8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitCo
an error has occurred:timed out waiting for the conditionThis error is likely caused by:- The kubelet is not running- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:- 'systemctl status kubelet'- 'journalctl -xeu kubelet'Additionally, a control plane component may have crashed or exited when started by the container runtime.To troubleshoot, list all containers using your preferred container runtimes CLI.Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'Once you have found the failing container, you can inspect its logs with:- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'couldn't initialize a Kubernetes clusterk8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:114k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:152k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:850k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:958k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:895k8s.io/kubernetes/cmd/kubeadm/app.Run/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50main.main_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25runtime.main/usr/local/go/src/runtime/proc.go:225runtime.goexit/usr/local/go/src/runtime/asm_amd64.s:1371error execution phase wait-control-planek8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:421k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:152k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:850k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:958k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:895k8s.io/kubernetes/cmd/kubeadm/app.Run/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50main.main_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25runtime.main/usr/local/go/src/runtime/proc.go:225runtime.goexit/usr/local/go/src/runtime/asm_amd64.s:1371

@figo
Copy link

figo commented Oct 8, 2021

We are aware of this issue, the tkg/kind/node does not support cgroup v2 yet, the workaround is to run the kind cluster on Linux kernel with cgroup v1 (slightly older linux)

@anibal-aguila
Copy link

anibal-aguila commented Oct 8, 2021

thanks @figo, after recreate grub to use cgroup v1
I get an error loop from mgmt-control-plane

Oct 08 15:30:11 tce-mgmt-control-plane-pr26z kubelet[1027]: E1008 15:30:11.234381    1027 kubelet.go:1384] "Failed to start ContainerManager" err="failed to get rootfs info: failed to get device for dir \"/var/lib/kubelet\": could not find device with major: 0, minor: 29 in cached partitions map"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.874508    1063 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.878613    1063 cri_stats_provider.go:369] "Failed to get the info of the filesystem with mountpoint" err="failed to get device for dir \"/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs\": could not find device with major: 0, minor: 29 in cached partitions map" mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.878663    1063 kubelet.go:1306] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.879435    1063 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"tce-mgmt-control-plane-pr26z.16ac17e0bf268f99", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"tce-mgmt-control-plane-pr26z", UID:"tce-mgmt-control-plane-pr26z", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message:"Starting kubelet.", Source:v1.EventSource{Component:"kubelet", Host:"tce-mgmt-control-plane-pr26z"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc05036e2b432ab99, ext:6377886713, loc:(*time.Location)(0x74bc600)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc05036e2b432ab99, ext:6377886713, loc:(*time.Location)(0x74bc600)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://172.18.0.3:6443/api/v1/namespaces/default/events": EOF'(may retry after sleeping)
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.884161    1063 kubelet.go:2211] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.896533    1063 kubelet_network_linux.go:79] "Failed to ensure that nat chain exists KUBE-MARK-DROP chain" err="error creating chain \"KUBE-MARK-DROP\": exit status 3: modprobe: ERROR: could not insert 'ip6_tables': Exec format error\nip6tables v1.8.4 (legacy): can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)\nPerhaps ip6tables or your kernel needs to be upgraded.\n"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.896631    1063 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.911302    1063 manager.go:1123] Failed to create existing container: /docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418: failed to identify the read-write layer ID for container "7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418". - open /var/lib/docker/image/btrfs/layerdb/mounts/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/mount-id: no such file or directory
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.911949    1063 manager.go:1123] Failed to create existing container: /docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418: failed to identify the read-write layer ID for container "7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418". - open /var/lib/docker/image/btrfs/layerdb/mounts/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/mount-id: no such file or directory
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.935350    1063 manager.go:1123] Failed to create existing container: /docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418: failed to identify the read-write layer ID for container "7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418". - open /var/lib/docker/image/btrfs/layerdb/mounts/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/mount-id: no such file or directory
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.936557    1063 manager.go:1123] Failed to create existing container: /docker/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418: failed to identify the read-write layer ID for container "7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418". - open /var/lib/docker/image/btrfs/layerdb/mounts/7b40b87a33e82c4e6d3c8a7c328f631c2c933a031e1defd9b653c802cf8d0418/mount-id: no such file or directory
Oct 08 15:30:18 tce-mgmt-control-plane-pr26z kubelet[1063]: E1008 15:30:18.951003    1063 kubelet.go:1384] "Failed to start ContainerManager" err="failed to get rootfs info: failed to get device for dir \"/var/lib/kubelet\": could not find device with major: 0, minor: 29 in cached partitions map"

image

Change to use cgroup v1

add systemd.unified_cgroup_hierarchy=0
grub-mkconfig
reboot

sudo vim /etc/default/grub
    GRUB_CMDLINE_LINUX=" ...  systemd.unified_cgroup_hierarchy=0"
sudo grub-mkconfig -o /boot/grub/grub.cfg
reboot

manifest-file.yaml

CLUSTER_CIDR: 100.96.0.0/11
CLUSTER_NAME: tce-mgmt
ENABLE_MHC: "false"
IDENTITY_MANAGEMENT_TYPE: none
INFRASTRUCTURE_PROVIDER: docker
LDAP_BIND_DN: ""
LDAP_BIND_PASSWORD: ""
LDAP_GROUP_SEARCH_BASE_DN: ""
LDAP_GROUP_SEARCH_FILTER: ""
LDAP_GROUP_SEARCH_GROUP_ATTRIBUTE: ""
LDAP_GROUP_SEARCH_NAME_ATTRIBUTE: cn
LDAP_GROUP_SEARCH_USER_ATTRIBUTE: DN
LDAP_HOST: ""
LDAP_ROOT_CA_DATA_B64: ""
LDAP_USER_SEARCH_BASE_DN: ""
LDAP_USER_SEARCH_FILTER: ""
LDAP_USER_SEARCH_NAME_ATTRIBUTE: ""
LDAP_USER_SEARCH_USERNAME: userPrincipalName
OIDC_IDENTITY_PROVIDER_CLIENT_ID: ""
OIDC_IDENTITY_PROVIDER_CLIENT_SECRET: ""
OIDC_IDENTITY_PROVIDER_GROUPS_CLAIM: ""
OIDC_IDENTITY_PROVIDER_ISSUER_URL: ""
OIDC_IDENTITY_PROVIDER_NAME: ""
OIDC_IDENTITY_PROVIDER_SCOPES: ""
OIDC_IDENTITY_PROVIDER_USERNAME_CLAIM: ""
OS_ARCH: ""
OS_NAME: ""
OS_VERSION: ""
SERVICE_CIDR: 100.64.0.0/13
TKG_HTTP_PROXY_ENABLED: "true"
CLUSTER_PLAN: dev

@syangsao
Copy link
Author

syangsao commented Oct 8, 2021

We are aware of this issue, the tkg/kind/node does not support cgroup v2 yet, the workaround is to run the kind cluster on Linux kernel with cgroup v1 (slightly older linux)

With Fedora 34, you don't need to run an older kernel release. There is a method to use cgroups v1 [1].

I ran the following command from the link and rebooted.

sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"

Make sure you verify this is configured upon reboot.

cat /proc/cmdline

BOOT_IMAGE=(hd0,msdos6)/vmlinuz-5.14.9-200.fc34.x86_64 root=/dev/mapper/fedora_localhost--live-root ro resume=/dev/mapper/fedora_localhost--live-swap rd.lvm.lv=fedora_localhost-live/root rd.lvm.lv=fedora_localhost-live/swap rhgb quiet systemd.unified_cgroup_hierarchy=0

cat /etc/default/grub

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora_localhost--live-swap rd.lvm.lv=fedora_localhost-live/root rd.lvm.lv=fedora_localhost-live/swap rhgb quiet systemd.unified_cgroup_hierarchy=0"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

I just verified that the installation finished and is working for me on Fedora 34 with cgroups v1.

[1] https://fedoramagazine.org/docker-and-fedora-32/

@martingruening
Copy link

I have the exactly same issue, but not on Fedora but on Debian 11 (AMD64) with a 5.10 kernel / Docker 20.10.9.
It took me quite some time of researching until I've found this issue. It would be great to find this information in the Getting Started section of the documention (together with the other Docker-specific requirements).

@figo
Copy link

figo commented Oct 11, 2021

cc @joshrosso @dvonthenen

@davidvonthenen
Copy link
Contributor

davidvonthenen commented Oct 11, 2021

Yup, we can definitely add that in the getting started guide so people aren't spinning their wheels.
cc: @kcoriordan

@davidvonthenen davidvonthenen added proposal/acccepted Change is accepted owner/docs Work executed by VMware documentation team kind/docs A change in documentation and removed kind/bug A bug in an existing capability triage/needs-triage Needs triage by TCE maintainers labels Oct 11, 2021
@kcoriordan kcoriordan added this to the v0.10.0 milestone Oct 12, 2021
@joshrosso joshrosso removed the proposal/acccepted Change is accepted label Jan 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/docs A change in documentation owner/docs Work executed by VMware documentation team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants