Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sporadic cluster creation timeout #412

Closed
swachter opened this issue Mar 27, 2019 · 9 comments
Closed

sporadic cluster creation timeout #412

swachter opened this issue Mar 27, 2019 · 9 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@swachter
Copy link

swachter commented Mar 27, 2019

We use kind to create a local Kubernetes cluster during a Gitlab-CI job. The cluster creation fails sporadically with a timeout.

Attached is

  • the debug output of kind
  • the output of systemctl status kubelet
  • the output of journalctl -xeu kubelet

The systemctl and journalctl commands where executed inside the kind-control-plane Docker container. The output of docker ps -a inside that container shows that no containers were created.

The journalctl output shows a couple of errors. The most significant one seems 'no space left on device' when starting cAdvisor.

I wonder how the device space is constraint and if there are parameters that might remedy the situation. kind uses the docker:dind service that is configured for the Gitlab-CI job that uses kind. Both, the build container and dind container run in the same POD. Docker is addressed by kind via tcp://localhost:2375. The kind-control-plane container in turn runs "inside" the dind container and brings its own Docker. Finally, that innermost Docker tries to run cAdvisor.

In my understandang there are 3 nested Docker containers:

  • The Gitlab-CI docker that runs the job POD.
  • The dind service inside the POD.
  • The Docker contained in the kind node image.

kind-1.log
systemctl-status-kubelet.txt
journalctl-xeu-kubelet.txt

@aojea
Copy link
Contributor

aojea commented Mar 27, 2019

You have several errors about no space left in the device

sys/fs/cgroup/blkio/docker/ebd0b4c8f8840ef15d77d256089b3c79bdfe85ab8152559f5abd5ee5b67c4463/system.slice: no space left on device

Is it possible that's the cause?

@swachter
Copy link
Author

swachter commented Mar 27, 2019

The "no space left" messages are definitly worrisome. I wonder, if something is wrong with disk space limits in view of three levels of containerization.

@BenTheElder
Copy link
Member

quick response between meetings, apologies for terseness!

kubelet will see the actual backing host disk. disk space has no isolation, and kind does not do disk eviction etc. because we cannot guarantee how much space users will reserve on the host.

if you're actually out of space though, there's really not much kind can do.

@BenTheElder BenTheElder added the kind/support Categorizes issue or PR as a support question. label Mar 28, 2019
@swachter
Copy link
Author

I checked the disk usage and it seems that there should be space. Yet, I am not sure how the interpret the output of docker system df. Here is the output of docker system df for all 3 involved docker daemons:

  1. On the cluster node (GKE)
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              46                  7                   19.1 GB             18.6 GB (97%)
Containers          11                  11                  8.336 kB            0 B (0%)
Local Volumes       3                   3                   1.416 kB            0 B (0%)
  1. In the pod (i.e. the dind service in the pod)
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              1                   1                   1.579GB             0B (0%)
Containers          1                   1                   70.3kB              0B (0%)
Local Volumes       1                   1                   814.5MB             0B (0%)
Build Cache         0                   0                   0B                  0B
  1. In the kind-control-plane
TYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE
Images              9                   0                   813.8MB             813.8MB (100%)
Containers          0                   0                   0B                  0B
Local Volumes       0                   0                   0B                  0B
Build Cache         0                   0                   0B                  0B

In addition, here is the output of df on the cluster node and inside the kind-control-plane:

  1. On the cluster node
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/root        1249792   476548    773244  39% /
devtmpfs         6665920        0   6665920   0% /dev
tmpfs            6667632        0   6667632   0% /dev/shm
tmpfs            6667632     1012   6666620   1% /run
tmpfs            6667632        0   6667632   0% /sys/fs/cgroup
tmpfs            6667632    44600   6623032   1% /tmp
tmpfs                256        0       256   0% /mnt/disks
/dev/sda8          11760       28     11408   1% /usr/share/oem
/dev/sda1       98868448 21431520  77420544  22% /mnt/stateful_partition
tmpfs               1024      136       888  14% /var/lib/cloud
overlayfs           1024      172       852  17% /etc
  1. Inside the kind-control-plane
Filesystem     1K-blocks     Used Available Use% Mounted on
overlay         98868448 24217352  74634712  25% /
tmpfs              65536        0     65536   0% /dev
tmpfs            6667632        0   6667632   0% /sys/fs/cgroup
tmpfs            6667632     8220   6659412   1% /run
tmpfs            6667632        0   6667632   0% /tmp
/dev/sda1       98868448 24217352  74634712  25% /etc/hosts
shm                65536        0     65536   0% /dev/shm
/dev/root        1249792   476548    773244  39% /lib/modules
tmpfs               5120        0      5120   0% /run/lock

@swachter
Copy link
Author

swachter commented Mar 28, 2019

Good (bad) news: The creation failure is permanent now (at least the last couple of times) ;-(

@swachter
Copy link
Author

I seems that our problem is related to leaking cgroups. The error message "no space left on device" is given while accessing /sys/fs/cgroup/... folders. A ticket that hints into that direction is: moby/moby#29638

I assume that the problem has nothing to do with kind and therefore close the ticket.

@swachter
Copy link
Author

For the record: restarting the node helped.

@BenTheElder
Copy link
Member

FTR: pretty sure this is running out inotify watches google/cadvisor#1581 (comment)

@aojea
Copy link
Contributor

aojea commented Apr 1, 2019

@swachter can you upload somewhere the cluster logs with kind export log

stg-0 pushed a commit to stg-0/kind that referenced this issue Jan 23, 2024
* Upload keos installer images first version

* Add temp yaml

* Add example

* Remove comment

* Remove print bind

* Update info

* Remove deprecated parser import

* Update Doc

* Update Doc

* Update Doc

* Update Doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants