Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

Closed
2 tasks done
ShiroZhang opened this issue Aug 20, 2024 · 5 comments
Closed
2 tasks done
Labels

Comments

@ShiroZhang
Copy link

Describe your question

every server have 2 nic ( enp6s18[1G] 、enp94s0f1np1[RDMA 100G])

image

and i want to run my experiment by nic enp94s0f1np1,so i set helm values like this

image

but in trial duration, it seems still running on nic enp6s18

image

I'm not sure if the reason is that task communicates using the RDMA protocol, and prometheus node exporter can’t monitor RDMA traffic

Checklist

  • Did you search the docs for a solution?
  • Did you search github issues to find if somebody asked this question before?
@ioga
Copy link
Contributor

ioga commented Aug 20, 2024

try setting NCCL_SOCKET_IFNAME env variable.

environment:
  environment_variables:
  - NCCL_DEBUG=INFO
  - NCCL_SOCKET_IFNAME=enp94s0f1np1

@ShiroZhang
Copy link
Author

environment_variables:

  • NCCL_DEBUG=INFO
  • NCCL_SOCKET_IFNAME=enp94s0f1np1

thanks for you reply

When using docker to start determined, the network mode of the container running the experiment appears to be host, so I can specify the network card(enp94s0f1np1) on the host to run the job. However, when using k8s to start determined, when starting the job, the network interface of the pod running the job only have eth0

root@manage-vm-node05:/mnt/volume/userdata/object_detection/fcos# kubectl get pod -n 4070ti
NAME                                           READY   STATUS    RESTARTS   AGE
det-55cd4460-exp-21-trial-21-attempt-1-j627w   1/1     Running   0          13m
det-55cd4460-exp-21-trial-21-attempt-1-rwm29   1/1     Running   0          13m
root@manage-vm-node05:/mnt/volume/userdata/object_detection/fcos# kubectl exec -it -n 4070ti det-55cd4460-exp-21-trial-21-attempt-1-j627w -c determined-container -- bash
root@det-55cd4460-exp-21-trial-21-attempt-1-j627w:/run/determined/workdir# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1480
        inet 10.244.125.85  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::6c23:b3ff:fe37:e26b  prefixlen 64  scopeid 0x20<link>
        ether 6e:23:b3:37:e2:6b  txqueuelen 1000  (Ethernet)
        RX packets 24075402  bytes 74018401828 (74.0 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13695264  bytes 73835112152 (73.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1508278  bytes 136700480 (136.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1508278  bytes 136700480 (136.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

root@det-55cd4460-exp-21-trial-21-attempt-1-j627w:/run/determined/workdir# 

So setting NCCL_SOCKET_IFNAME env variable will cause errors
image

So my question is how can I set it up so that I can use the enp94s0f1np1 network port in the experiment when I use k8s to deploy determined, just like when I deploy in docker

@ioga
Copy link
Contributor

ioga commented Aug 20, 2024

if you start a k8s pod outside of determined, does it have access to that network interface? if not, you need to configure your k8s setup to allow that. you can use pod_spec feature on determined side if this needs any modification to the pod specs.

@ioga
Copy link
Contributor

ioga commented Aug 21, 2024

I got a suggestion that you should make sure your nvidia operator has RDMA configured: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#installing-the-gpu-operator-and-enabling-gpudirect-rdma

@ShiroZhang
Copy link
Author

I got a suggestion that you should make sure your nvidia operator has RDMA configured: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#installing-the-gpu-operator-and-enabling-gpudirect-rdma

thanks lot,i'm going to explore that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants