-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get stuck after one epoch of training (Multi GPU DDP) See this! #203
Comments
I also met this problem and I wonder have you fixed it yet? |
Nope. Maybe you can try mmdetection3d. |
I also have no clue (my school server only gets cuda 10.0 so I am still using torch 1.1.0 for training and there is no issue in that version). I feel the problem is related to apex according to a few other recent issues and you may want to replace Apex syncbn with torch's sync bn. Will check this more after I finish this semester ( in two weeks) |
I also encounter this problem. How to fix it ?? |
I think numba version may be incompatible with other dependencies. You can try my version plan above or carefully follow the author's instructions for each package. |
But there is no cuda 10.1, torch=1.4.0, i.e. torch=1.4.0+cu101 |
It's said that "PyTorch 1.4.0 shipped with CUDA 10.1 by default, so there is no separate package with the cu101 suffix, those are only for alternative versions. " |
The problem is related to ddp in recent torch versions. It should be fixed now e30f768 And you should be able to use most recent torch versions. Let me know if there are any further problems |
I am still stuck in the training process after merge the last two commits https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2 My environment is torch 1.7.0+cu101, V100-SXM2 16G. |
oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time. Could you try a simple example? Basically, I push a new cfg just now to simulate the training process. Could you run
it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ? |
Hi there, i use waymo dataset and i don't know the differences in your debug setting. But i test load_interval=1000 training on waymo dataset, the stuck disappear. I don't know why. |
Got it. Yeah, I only change the interval to subsample dataset. Uhm, weird then. Maybe just use 1.4 if it is fine for your case. I will look into this further |
When load_interval = 5, it stuck, load_interval = 1000, it woks. It confuses me. |
thanks, another thing you can try is
CenterPoint/det3d/torchie/apis/train.py Line 268 in 3fd0b87
to
I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.) I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3). |
I have tried several couples of combinations but still stuck. The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process. However, it slows the training process and i don't know why. |
maybe related to multiprocessing, like adding a line like this to train.py before init ddp
no clue if this works or not though because I just could not reproduce your error |
I have the same problem ,have you solved it ? I guess is it related to num_workers? |
I have the same problem!!!! |
Do you have any ideas |
hello, i'am confused about the effect of load_interval.... Can you explain what does this parameter mean? |
load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset). Unfortunately, I am not able to reproduce this issue ... Also see #314 |
i still stuck by way 1 .....TAT |
i guess the problem come from 'finding looplift candidates', i always stuck in this step after train an epoch, do you know what that means? |
it has nothing to do with finding looplift candidates. It is the byproduct of starting a new epoch. Unfortunately, I don't know what the root cause is (someone gets this issue and some one doesn;t...) |
i tried to change the load_interval from 1 to 100 just now, and seems to no stuck. |
i have try several ways including change the load_interval as mentioned above this issue. i suggest you to try "I have tried several couples of combinations but still stuck. The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process. However, it slows the training process and i don't know why. |
ok, i'll try, thankyou~ |
1 similar comment
ok, i'll try, thankyou~ |
hello, i come back~ i have tried this way for train recently, it seems no stuck anymore, and the speed seems normal |
wow, that's amazing! |
this one may also be relevant for other people pytorch/pytorch#50820 |
I try to utilize NCCL_BLOCKING_WAIT=1, but it does not work for me |
When training with multiple gpus, the programme stops at "INFO - finding looplift candidates" after one epoch of training? This info sentence might come from numba, but I am not able to exactly locate it. Is there anyone who meets the same problem?
See #203 (comment) for solution
The text was updated successfully, but these errors were encountered: