Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

PyTorchJob worker pods crashloops in non-default namespace #258

Open
jobvarkey opened this issue Feb 5, 2020 · 7 comments
Open

PyTorchJob worker pods crashloops in non-default namespace #258

jobvarkey opened this issue Feb 5, 2020 · 7 comments
Labels

Comments

@jobvarkey
Copy link

Hello,

I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.

The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).

But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.

root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994
NAME READY STATUS RESTARTS AGE
jp-nb1-0 2/2 Running 0 18h
pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m
pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m

kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt
kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt
kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt

pytorch_job_mnist_gloo.yaml.txt

Can anyone please help with this issue?

Thanks,
Job

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.74

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot issue-label-bot bot added the bug label Feb 5, 2020
@jlewi jlewi added kind/bug and removed bug labels Feb 5, 2020
@gaocegege
Copy link
Member

It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace?

@jobvarkey
Copy link
Author

The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks

@gaocegege
Copy link
Member

Can you show me the result of kubectl describe ns i70994?

@jobvarkey
Copy link
Author

root@0939-jdeml-m01:~# kubectl describe ns i70994
Name: i70994
Labels: istio-injection=enabled
katib-metricscollector-injection=enabled
serving.kubeflow.org/inferenceservice=enabled
Annotations: owner: [email protected]
Status: Active

No resource quota.

No resource limits.

@636
Copy link

636 commented Apr 8, 2020

Hi @jobvarkey
I guess that this cause is istio-injection enabled on your ns.
Could you try to append below code to template section in pytorch_job_mnist_gloo.yaml.
You can disable istio-injection your PyTorchJob.

        metadata:
          annotations:
            sidecar.istio.io/inject: "false"

see: https://istio.io/docs/setup/additional-setup/sidecar-injection/

But I don't know if this change affect other problem.
Could anyone explain it ?

By setting this change, I was able to run mnist_gloo.

@shawnzhu
Copy link
Member

But I don't know if this change affect other problem.
Could anyone explain it ?

This comment provides details kubeflow/kubeflow#4935 (comment)

Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., pytorch-dist-mnist-gloo-master-0.<namespace>.

By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like RuntimeError: Connection reset by peer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants