Failed to deploy kubeflow on a kind cluster #2901

Al-Pragliola · 2024-10-25T12:53:26Z

Validation Checklist

Is this a Kubeflow issue?
Are you posting in the right repository ?
Did you follow the Kubeflow installation guideline ?
Is the issue report properly structured and detailed with version numbers?
Is this for Kubeflow development ?
Would you like to work on this issue?
You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Version

master

Describe your issue

I encountered an error while deploying Kubeflow on a KIND cluster. The error logs indicate a failure in creating a containerd task due to an inability to start a new OS thread, suggesting that the maximum number of user processes might need to be increased (ulimit -u). Here are the specific error message:

Error: failed to create containerd task: failed to start shim: start failed: runtime: failed to create new OS thread (have 5 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)

My system meets all the prerequisites, and I am running Fedora 40.

Here's the output of ulimit -a on my host system:

real-time non-blocking time  (microseconds, -R) 200000
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 251414
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 8388608
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 8388608
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

and the limits on the control-plane (of kind):

docker exec -it kubeflow-control-plane bash
root@kubeflow-control-plane:/# ulimit -a
real-time non-blocking time  (microseconds, -R) 200000
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 251414
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Workaround:

add two workers to kind

cat <<EOF | kind create cluster --name=kubeflow --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        "service-account-issuer": "kubernetes.default.svc"
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
- role: worker
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865

- role: worker
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
EOF

Steps to reproduce the issue

run

sudo sysctl fs.inotify.max_user_instances=2280
sudo sysctl fs.inotify.max_user_watches=1255360

run

cat <<EOF | kind create cluster --name=kubeflow --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        "service-account-issuer": "kubernetes.default.svc"
        "service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
EOF

run

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

Put here any screenshots or videos (optional)

The text was updated successfully, but these errors were encountered:

juliusvonkohout · 2024-10-31T17:44:37Z

@tarekabouzeid @diegolovison are you able to reproduce this?
I also use fedora for Kubeflow development.

tarekabouzeid · 2024-11-03T14:03:39Z

Hi @Al-Pragliola ,
I installed Fedora 40 in a vm and tried to re-produce following the steps mentioned in the issue, but didn't get same results.
I used docker engine then , but here is reported problem with podman if that's what you are using
My environment:

kind version 0.24.0
Docker version 27.3.1, build ce12230

juliusvonkohout · 2024-11-04T08:59:17Z

Thank you @tarekabouzeid. Our cicd is also green, so please reopen if it is still valid.

juliusvonkohout assigned tarekabouzeid Oct 31, 2024

juliusvonkohout closed this as completed Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to deploy kubeflow on a kind cluster #2901

Failed to deploy kubeflow on a kind cluster #2901

Al-Pragliola commented Oct 25, 2024 •

edited

Loading

juliusvonkohout commented Oct 31, 2024

tarekabouzeid commented Nov 3, 2024

juliusvonkohout commented Nov 4, 2024

Failed to deploy kubeflow on a kind cluster #2901

Failed to deploy kubeflow on a kind cluster #2901

Comments

Al-Pragliola commented Oct 25, 2024 • edited Loading

Validation Checklist

Version

Describe your issue

Steps to reproduce the issue

Put here any screenshots or videos (optional)

juliusvonkohout commented Oct 31, 2024

tarekabouzeid commented Nov 3, 2024

juliusvonkohout commented Nov 4, 2024

Al-Pragliola commented Oct 25, 2024 •

edited

Loading