-
Notifications
You must be signed in to change notification settings - Fork 143
PyTorchJob worker pods crashloops in non-default namespace #258
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace? |
The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks |
Can you show me the result of |
root@0939-jdeml-m01:~# kubectl describe ns i70994 No resource quota. No resource limits. |
Hi @jobvarkey metadata:
annotations:
sidecar.istio.io/inject: "false" see: https://istio.io/docs/setup/additional-setup/sidecar-injection/ But I don't know if this change affect other problem. By setting this change, I was able to run mnist_gloo. |
This comment provides details kubeflow/kubeflow#4935 (comment) Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like |
Hello,
I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.
The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).
But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.
root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994
NAME READY STATUS RESTARTS AGE
jp-nb1-0 2/2 Running 0 18h
pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m
pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m
kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt
kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt
pytorch_job_mnist_gloo.yaml.txt
Can anyone please help with this issue?
Thanks,
Job
The text was updated successfully, but these errors were encountered: