PyTorchJob worker pods crashloops in non-default namespace #258

jobvarkey · 2020-02-05T15:40:48Z

Hello,

I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.

The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).

But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.

root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994
NAME READY STATUS RESTARTS AGE
jp-nb1-0 2/2 Running 0 18h
pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m
pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m

kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt
kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt

pytorch_job_mnist_gloo.yaml.txt

Can anyone please help with this issue?

Thanks,
Job

issue-label-bot · 2020-02-05T15:40:56Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
bug	0.74

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-02-06T01:43:21Z

It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace?

jobvarkey · 2020-02-06T04:27:10Z

The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks

gaocegege · 2020-02-06T06:48:49Z

Can you show me the result of kubectl describe ns i70994?

jobvarkey · 2020-02-06T14:10:37Z

root@0939-jdeml-m01:~# kubectl describe ns i70994
Name: i70994
Labels: istio-injection=enabled
katib-metricscollector-injection=enabled
serving.kubeflow.org/inferenceservice=enabled
Annotations: owner: [email protected]
Status: Active

No resource quota.

No resource limits.

636 · 2020-04-08T04:54:53Z

Hi @jobvarkey
I guess that this cause is istio-injection enabled on your ns.
Could you try to append below code to template section in pytorch_job_mnist_gloo.yaml.
You can disable istio-injection your PyTorchJob.

        metadata:
          annotations:
            sidecar.istio.io/inject: "false"

see: https://istio.io/docs/setup/additional-setup/sidecar-injection/

But I don't know if this change affect other problem.
Could anyone explain it ?

By setting this change, I was able to run mnist_gloo.

shawnzhu · 2020-06-23T02:27:19Z

But I don't know if this change affect other problem.
Could anyone explain it ?

This comment provides details kubeflow/kubeflow#4935 (comment)

Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., pytorch-dist-mnist-gloo-master-0.<namespace>.

By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like RuntimeError: Connection reset by peer.

issue-label-bot bot added the bug label Feb 5, 2020

jlewi added kind/bug and removed bug labels Feb 5, 2020

shawnzhu mentioned this issue Jun 23, 2020

Updates TFJob and PytorchJob document to disable istio sidecar inject kubeflow/website#2011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorchJob worker pods crashloops in non-default namespace #258

PyTorchJob worker pods crashloops in non-default namespace #258

jobvarkey commented Feb 5, 2020

issue-label-bot bot commented Feb 5, 2020

gaocegege commented Feb 6, 2020

jobvarkey commented Feb 6, 2020

gaocegege commented Feb 6, 2020

jobvarkey commented Feb 6, 2020

636 commented Apr 8, 2020 •

edited

Loading

shawnzhu commented Jun 23, 2020

PyTorchJob worker pods crashloops in non-default namespace #258

PyTorchJob worker pods crashloops in non-default namespace #258

Comments

jobvarkey commented Feb 5, 2020

issue-label-bot bot commented Feb 5, 2020

gaocegege commented Feb 6, 2020

jobvarkey commented Feb 6, 2020

gaocegege commented Feb 6, 2020

jobvarkey commented Feb 6, 2020

636 commented Apr 8, 2020 • edited Loading

shawnzhu commented Jun 23, 2020

636 commented Apr 8, 2020 •

edited

Loading