🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

ShiroZhang · 2024-08-20T04:05:04Z

Describe your question

every server have 2 nic ( enp6s18[1G] 、enp94s0f1np1[RDMA 100G])

and i want to run my experiment by nic enp94s0f1np1,so i set helm values like this

but in trial duration, it seems still running on nic enp6s18

I'm not sure if the reason is that task communicates using the RDMA protocol, and prometheus node exporter can’t monitor RDMA traffic

Checklist

Did you search the docs for a solution?
Did you search github issues to find if somebody asked this question before?

ioga · 2024-08-20T06:19:21Z

try setting NCCL_SOCKET_IFNAME env variable.

environment:
  environment_variables:
  - NCCL_DEBUG=INFO
  - NCCL_SOCKET_IFNAME=enp94s0f1np1

ShiroZhang · 2024-08-20T07:10:39Z

environment_variables:

NCCL_DEBUG=INFO

NCCL_SOCKET_IFNAME=enp94s0f1np1

thanks for you reply

When using docker to start determined, the network mode of the container running the experiment appears to be host, so I can specify the network card(enp94s0f1np1) on the host to run the job. However, when using k8s to start determined, when starting the job, the network interface of the pod running the job only have eth0

root@manage-vm-node05:/mnt/volume/userdata/object_detection/fcos# kubectl get pod -n 4070ti
NAME                                           READY   STATUS    RESTARTS   AGE
det-55cd4460-exp-21-trial-21-attempt-1-j627w   1/1     Running   0          13m
det-55cd4460-exp-21-trial-21-attempt-1-rwm29   1/1     Running   0          13m
root@manage-vm-node05:/mnt/volume/userdata/object_detection/fcos# kubectl exec -it -n 4070ti det-55cd4460-exp-21-trial-21-attempt-1-j627w -c determined-container -- bash
root@det-55cd4460-exp-21-trial-21-attempt-1-j627w:/run/determined/workdir# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1480
        inet 10.244.125.85  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::6c23:b3ff:fe37:e26b  prefixlen 64  scopeid 0x20<link>
        ether 6e:23:b3:37:e2:6b  txqueuelen 1000  (Ethernet)
        RX packets 24075402  bytes 74018401828 (74.0 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13695264  bytes 73835112152 (73.8 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1508278  bytes 136700480 (136.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1508278  bytes 136700480 (136.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

root@det-55cd4460-exp-21-trial-21-attempt-1-j627w:/run/determined/workdir#

So setting NCCL_SOCKET_IFNAME env variable will cause errors

So my question is how can I set it up so that I can use the enp94s0f1np1 network port in the experiment when I use k8s to deploy determined, just like when I deploy in docker

ioga · 2024-08-20T18:16:39Z

if you start a k8s pod outside of determined, does it have access to that network interface? if not, you need to configure your k8s setup to allow that. you can use pod_spec feature on determined side if this needs any modification to the pod specs.

ioga · 2024-08-21T19:40:06Z

I got a suggestion that you should make sure your nvidia operator has RDMA configured: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#installing-the-gpu-operator-and-enabling-gpudirect-rdma

ShiroZhang · 2024-08-22T01:59:36Z

I got a suggestion that you should make sure your nvidia operator has RDMA configured: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#installing-the-gpu-operator-and-enabling-gpudirect-rdma

thanks lot，i'm going to explore that

ShiroZhang added the question label Aug 20, 2024

ShiroZhang closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

ShiroZhang commented Aug 20, 2024

ioga commented Aug 20, 2024

ShiroZhang commented Aug 20, 2024

ioga commented Aug 20, 2024

ioga commented Aug 21, 2024

ShiroZhang commented Aug 22, 2024

🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839

Comments

ShiroZhang commented Aug 20, 2024

Describe your question

every server have 2 nic ( enp6s18[1G] 、enp94s0f1np1[RDMA 100G])

and i want to run my experiment by nic enp94s0f1np1,so i set helm values like this

but in trial duration, it seems still running on nic enp6s18

Checklist

ioga commented Aug 20, 2024

ShiroZhang commented Aug 20, 2024

ioga commented Aug 20, 2024

ioga commented Aug 21, 2024

ShiroZhang commented Aug 22, 2024