-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior #8915
Comments
Thanks for the report. Can you share the full stacktrace for this error? To avoid restarting 5 times, you can configure |
experiment_138_trial_139_logs.txt |
images with |
Thanks, I fixed the docker env accordingly but it did not seem to change anything in regards to the above ddp rank problem |
|
Reading up on the NCCL_SOCKET_IFNAME, how do I determine if that needs to be set for the docker containers determined runs off of? My previous distributed training attempts were via Conda envs with no containers involved. |
okay. I assume this was w/o docker, so in this thread we are basically troubleshooting "why does NCCL hang in docker".
few more ideas to try:
|
I don't know if there is any confusion here but this thread is "Does the basic setup to OSS Determined-AI need modifications to the NCCL to prevent hanging when doing distributed training" I am a bit unclear on why the assumption is that this was w/o docker? I will investigate the modifications to the NCCL env variables but I find it strange that a job will run on 6 GPUs just fine, fail on 7-8 GPUs and the problem would be related to NCCL comms. It feels like if that was the case, the job would also fail on 6 GPUs. |
I meant, when you said
did you run it in docker, or not? I assumed you'd do it without docker, because that's how people usually do it.
30 minute NCCL comms timeout is bizarre and unexpected. so I am heavily discounting the params size mismatch which comes after. so you are saying when you set
my understanding is that you're running this example: https://github.com/determined-ai/determined/blob/main/examples/tutorials/mnist_pytorch/distributed.yaml it works fine for me on 8 gpus. this is DDP, there's no model sharding at all, as it's supposed to be replicated across the GPUs. I don't see how that specific issue can be caused by the training code. that's why NCCL shared memory settings and NCCL transport options aimed at fixing the 30 minute hangup seem like a more promising path forward here for me. can you try running other examples, e.g. https://github.com/determined-ai/determined-examples/blob/main/computer_vision/cifar10_pytorch/distributed.yaml but with 8 slots instead of 16?
we've seen such symptoms happen on 8 GPU servers when it ended up being a faulty GPU card which had to be replaced. these are hard to diagnose, so let's try to exhaust the other options first. for this, a possible set test of tests could include disabling a couple working cards in det and then running an experiment which uses the remaining faulty candidate cards. |
Hey thanks for taking the time to help me out with this. I did a bit more robust debugging and the problem appears to be that two of the GPU cards do not work together. The previous 6-slot experiment worked because those two were not on the same job. I was able to replicate the NCCL hanging behavior on another 6-slot experiment that those two GPUs were jointly on and do a third ablation test where the problem GPUs were separated that worked, replicating the first experiment.
I tested both of the shared memory and transport option env variables and they both individually did fix the problem at 7-8 GPUs. The mnist code trained to completion.
I just attempted something similar by removing all of the GPUs from the agent pool except the 2 problem ones, was able to confirm similar behavior with the NCCL hanging, and am currently attempting the inverse for 7 GPUs Do the changes to the shared memory and transport working imply faulty communication buses or cards? I guess I don't really understand why both of those env variable changes work |
okay, good to hear you've tracked it down.
I hoped that if we discover it works w/o the shared memory and p2p, it'll confirm something's wrong with intra-node NVLink. I am not a hardware troubleshooting expert so I'd refer you to your hardware vendor (or nvidia) for that. as far as I understand, technicians usually
|
I closed this since my initial query seems to be solved. But I do have a follow up, how do I debug the intra-node NVLink? The I am also not a hardware troubleshooter so I'll leave that for last. Edit: Need to test GPU 1 and GPU 3's interaction but this implies maybe its a PIX issue (2 and 3 were the problem children) Edit2: For anyone who finds this at a latter date, the issue is the PCIe Access Control System is causing hanging behavior by routing the GPU 3's communications to GPU 1 and 2 through the CPU, slowing it to a crawl and causing the distributed training to timeout. The NCCL_P2P_DISABLE=1 forces the GPUs to not use the PCI interface to talk. Doing this can result in a loss of performance. It appears, if there is a will, that one can turn off the PCIe ACS via: nccl_docs, TBD if it directly fixes my issue |
Describe the bug
I am attempting to test my configuration/setup by running the mnist tutorial in distributed mode across 1 agent with 8 gpus. I've set the experiment configuration to 8 slots_per_trial. When I do that parts of the training loop hang, eventually erroring out with a watchdog error.
On deeper inspection it appears that a runtime error is occurring called:
DDP expects same model across all ranks, but Rank 0 has 8 params while rank 3 has inconsistent 0 params
, It repeats this error for multiple ranks but always theRank # has 8 params Rank # as 0 params
.Does anyone have any insight on to what is going on here? I assume there isn't enough data to shard across the model or that the model can't be correctly be broken up into 8 chunks? Is there a way to prevent this from happening prior to training or a fallback so it doesn't run this 5 times?
This also occurs on an agent with 8 slots and an experiment configuration of 7 slots per trial, but doesn't occur at 6 slots per trial
Reproduction Steps
Expected Behavior
I expected the model to run on 8 gpus without issue
Screenshot
N/A
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: