-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🤔[question] “dtrainNetworkInterface” seems does not take effect when deploy on k8s #9839
Comments
try setting
|
if you start a k8s pod outside of determined, does it have access to that network interface? if not, you need to configure your k8s setup to allow that. you can use pod_spec feature on determined side if this needs any modification to the pod specs. |
I got a suggestion that you should make sure your nvidia operator has RDMA configured: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html#installing-the-gpu-operator-and-enabling-gpudirect-rdma |
thanks lot,i'm going to explore that |
Describe your question
every server have 2 nic ( enp6s18[1G] 、enp94s0f1np1[RDMA 100G])
and i want to run my experiment by nic enp94s0f1np1,so i set helm values like this
but in trial duration, it seems still running on nic enp6s18
I'm not sure if the reason is that task communicates using the RDMA protocol, and prometheus node exporter can’t monitor RDMA traffic
Checklist
The text was updated successfully, but these errors were encountered: