Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to deploy kubeflow on a kind cluster #2901

Closed
6 of 7 tasks
Al-Pragliola opened this issue Oct 25, 2024 · 3 comments
Closed
6 of 7 tasks

Failed to deploy kubeflow on a kind cluster #2901

Al-Pragliola opened this issue Oct 25, 2024 · 3 comments
Assignees

Comments

@Al-Pragliola
Copy link
Contributor

Al-Pragliola commented Oct 25, 2024

Validation Checklist

  • Is this a Kubeflow issue?
  • Are you posting in the right repository ?
  • Did you follow the Kubeflow installation guideline ?
  • Is the issue report properly structured and detailed with version numbers?
  • Is this for Kubeflow development ?
  • Would you like to work on this issue?
  • You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Version

master

Describe your issue

I encountered an error while deploying Kubeflow on a KIND cluster. The error logs indicate a failure in creating a containerd task due to an inability to start a new OS thread, suggesting that the maximum number of user processes might need to be increased (ulimit -u). Here are the specific error message:

Error: failed to create containerd task: failed to start shim: start failed: runtime: failed to create new OS thread (have 5 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)

My system meets all the prerequisites, and I am running Fedora 40.

Here's the output of ulimit -a on my host system:

real-time non-blocking time  (microseconds, -R) 200000
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 251414
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 8388608
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 8388608
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

and the limits on the control-plane (of kind):

docker exec -it kubeflow-control-plane bash
root@kubeflow-control-plane:/# ulimit -a
real-time non-blocking time  (microseconds, -R) 200000
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 251414
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Workaround:

add two workers to kind

cat <<EOF | kind create cluster --name=kubeflow --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        "service-account-issuer": "kubernetes.default.svc"
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
- role: worker
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865

- role: worker
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
EOF

Steps to reproduce the issue

  1. run
sudo sysctl fs.inotify.max_user_instances=2280
sudo sysctl fs.inotify.max_user_watches=1255360
  1. run
cat <<EOF | kind create cluster --name=kubeflow --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        "service-account-issuer": "kubernetes.default.svc"
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
EOF
  1. run
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

Put here any screenshots or videos (optional)

image
image

@juliusvonkohout
Copy link
Member

@tarekabouzeid @diegolovison are you able to reproduce this?
I also use fedora for Kubeflow development.

@tarekabouzeid
Copy link
Member

Hi @Al-Pragliola ,
I installed Fedora 40 in a vm and tried to re-produce following the steps mentioned in the issue, but didn't get same results.
I used docker engine then , but here is reported problem with podman if that's what you are using
My environment:

kind version 0.24.0
Docker version 27.3.1, build ce12230

@juliusvonkohout
Copy link
Member

Thank you @tarekabouzeid. Our cicd is also green, so please reopen if it is still valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants