You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049167, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803980 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049166, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1575427 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1575428) of binary: .../.conda/envs/mvp/bin/python
Traceback (most recent call last):
File ".../.conda/envs/mvp/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "..../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File .....conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "....../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
./tools/train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-10_02:41:36
host : AI-3090
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1575428)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1575428
========================================================
I suggest using openpcdet https://github.com/open-mmlab/OpenPCDet. This codebase is not actively maintained so that newer version of torch / cuda / apex may have some unknown issues
Hi, I am try to train from scratch by myself to reproduce the training result of the paper.
But I found that the loss of the training will come to NaN after several hours.
I build the environment follow the INSTALL.md with nuScene Dataset.
Environment:
GPU Driver:
PyTorch:
ffmpeg 4.3 hf484d3e_0 pytorch
pytorch 2.0.1 py3.9_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda 11.8 h7e8668a_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchtriton 2.0.0 py39 pytorch
torchvision 0.15.2 py39_cu118 pytorch
CUDA_HOME:
Some Modification:
train.py
requirement.txt
[for both deform_pool_cuda.cpp and deform_conv_cuda.cpp, substitude all "AT_CHECK" with "TORCH_CHECK"]
deform_pool_cuda.cpp
deform_conv_cuda.cpp
Log:
-> Command:
(mvp) $ torchrun --nproc_per_node=2 ./tools/train.py ./configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual.py
-> Log file
CenterPoint/work_dirs/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual/20230709_212542.log
Then, the program end with the error:
Could anyone solve this problem ?
@tianweiy
The text was updated successfully, but these errors were encountered: