Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train resulting as NaN loss #396

Open
OuYaozhong opened this issue Jul 10, 2023 · 2 comments
Open

Train resulting as NaN loss #396

OuYaozhong opened this issue Jul 10, 2023 · 2 comments

Comments

@OuYaozhong
Copy link

OuYaozhong commented Jul 10, 2023

Hi, I am try to train from scratch by myself to reproduce the training result of the paper.

But I found that the loss of the training will come to NaN after several hours.

I build the environment follow the INSTALL.md with nuScene Dataset.

Environment:

GPU Driver:
image

PyTorch:
ffmpeg 4.3 hf484d3e_0 pytorch
pytorch 2.0.1 py3.9_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda 11.8 h7e8668a_5 pytorch
pytorch-mutex 1.0 cuda pytorch
torchtriton 2.0.0 py39 pytorch
torchvision 0.15.2 py39_cu118 pytorch

CUDA_HOME:

$ echo $CUDA_HOME
/usr/local/cuda-11.8/

Some Modification:
image

train.py
image

requirement.txt
image

[for both deform_pool_cuda.cpp and deform_conv_cuda.cpp, substitude all "AT_CHECK" with "TORCH_CHECK"]
deform_pool_cuda.cpp
image

deform_conv_cuda.cpp
image

Log:

-> Command:
(mvp) $ torchrun --nproc_per_node=2 ./tools/train.py ./configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual.py

-> Log file CenterPoint/work_dirs/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual/20230709_212542.log

2023-07-09 21:25:42,078 - INFO - Start running, host: ..., work_dir: ...CenterPoint/work_dirs/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_virtual
2023-07-09 21:25:42,078 - INFO - workflow: [('train', 1)], max: 20 epochs
2023-07-09 21:27:00,427 - INFO - Epoch [1/20][5/15448]	lr: 0.00010, eta: 56 days, 0:44:16, time: 15.669, data_time: 6.412, transfer_time: 0.023, forward_time: 0.589, loss_parse_time: 0.000 memory: 7003, 
2023-07-09 21:27:00,427 - INFO - task : ['car'], loss: 16.5108, hm_loss: 5.3816, loc_loss: 44.5170, loc_loss_elem: ['3.9960', '4.4768', '13.8939', '3.5633', '3.9826', '2.5209', '3.9625', '4.9418', '6.3818', '3.9209'], num_positive: 54.0000
2023-07-09 21:27:00,428 - INFO - task : ['truck', 'construction_vehicle'], loss: 33.3630, hm_loss: 21.8048, loc_loss: 46.2328, loc_loss_elem: ['4.5750', '7.1211', '7.1248', '5.1617', '3.3515', '5.1795', '4.3006', '8.2093', '5.3650', '5.8522'], num_positive: 27.6000
2023-07-09 21:27:00,428 - INFO - task : ['bus', 'trailer'], loss: 46.8869, hm_loss: 39.0528, loc_loss: 31.3363, loc_loss_elem: ['3.1959', '4.1191', '4.2974', '2.6856', '3.9514', '3.1382', '4.2657', '7.0581', '4.3812', '3.3026'], num_positive: 22.2000
2023-07-09 21:27:00,428 - INFO - task : ['barrier'], loss: 38.3416, hm_loss: 19.3048, loc_loss: 76.1475, loc_loss_elem: ['17.8052', '7.7582', '11.3469', '7.6502', '6.7226', '6.7809', '6.5041', '6.0544', '10.2082', '5.3636'], num_positive: 38.0000
2023-07-09 21:27:00,428 - INFO - task : ['motorcycle', 'bicycle'], loss: 32.1542, hm_loss: 19.0562, loc_loss: 52.3918, loc_loss_elem: ['6.7211', '6.6712', '15.9350', '3.7224', '3.3566', '4.6889', '4.7669', '5.6455', '3.9710', '5.2431'], num_positive: 41.2000
2023-07-09 21:27:00,428 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 42.4973, hm_loss: 25.1576, loc_loss: 69.3589, loc_loss_elem: ['6.9013', '6.1598', '14.6911', '4.6068', '5.6175', '5.6351', '9.9897', '10.2892', '8.2892', '13.4024'], num_positive: 28.4000

2023-07-09 21:27:07,797 - INFO - Epoch [1/20][10/15448]	lr: 0.00010, eta: 30 days, 15:36:38, time: 1.474, data_time: 0.578, transfer_time: 0.021, forward_time: 0.369, loss_parse_time: 0.000 memory: 7527, 
2023-07-09 21:27:07,797 - INFO - task : ['car'], loss: 10.8206, hm_loss: 4.2498, loc_loss: 26.2833, loc_loss_elem: ['2.5260', '2.8105', '7.5488', '2.3111', '2.5731', '1.5947', '2.7838', '3.1014', '3.0734', '2.6687'], num_positive: 63.4000
2023-07-09 21:27:07,797 - INFO - task : ['truck', 'construction_vehicle'], loss: 19.5815, hm_loss: 13.6572, loc_loss: 23.6973, loc_loss_elem: ['2.4139', '2.9423', '3.6419', '2.8161', '1.9625', '3.2340', '2.6705', '4.2424', '2.6038', '2.7002'], num_positive: 29.0000
2023-07-09 21:27:07,797 - INFO - task : ['bus', 'trailer'], loss: 24.2755, hm_loss: 17.1798, loc_loss: 28.3831, loc_loss_elem: ['3.7832', '3.0236', '5.5242', '2.1372', '3.5866', '2.3797', '2.9376', '3.9165', '3.2531', '3.3248'], num_positive: 23.8000
2023-07-09 21:27:07,797 - INFO - task : ['barrier'], loss: 23.1140, hm_loss: 12.7906, loc_loss: 41.2938, loc_loss_elem: ['7.4719', '3.3597', '8.9418', '3.7879', '4.3072', '3.5921', '3.7401', '4.2947', '4.8814', '3.3449'], num_positive: 28.6000
2023-07-09 21:27:07,797 - INFO - task : ['motorcycle', 'bicycle'], loss: 18.0098, hm_loss: 9.4185, loc_loss: 34.3654, loc_loss_elem: ['3.6293', '4.8634', '9.6247', '2.7023', '2.6011', '3.8894', '3.2744', '3.7622', '2.2092', '3.4387'], num_positive: 40.0000
2023-07-09 21:27:07,797 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 22.2361, hm_loss: 11.5873, loc_loss: 42.5953, loc_loss_elem: ['3.4575', '4.7063', '8.8348', '3.5326', '3.2648', '3.1636', '6.7733', '4.9222', '6.1251', '7.1714'], num_positive: 37.8000

2023-07-09 21:27:20,877 - INFO - Epoch [1/20][15/15448]	lr: 0.00010, eta: 23 days, 13:14:31, time: 2.616, data_time: 1.968, transfer_time: 0.020, forward_time: 0.243, loss_parse_time: 0.000 memory: 7766, 
2023-07-09 21:27:20,878 - INFO - task : ['car'], loss: 13.5274, hm_loss: 5.7041, loc_loss: 31.2934, loc_loss_elem: ['3.8299', '3.2160', '7.3882', '3.1934', '2.9355', '2.7282', '3.0090', '3.5887', '3.2064', '3.4763'], num_positive: 42.0000
2023-07-09 21:27:20,878 - INFO - task : ['truck', 'construction_vehicle'], loss: 14.0053, hm_loss: 9.6582, loc_loss: 17.3887, loc_loss_elem: ['1.9723', '2.1698', '2.6261', '1.7503', '1.6135', '2.1626', '2.1163', '2.3670', '1.9544', '2.2428'], num_positive: 36.6000
2023-07-09 21:27:20,878 - INFO - task : ['bus', 'trailer'], loss: 21.6122, hm_loss: 15.7104, loc_loss: 23.6072, loc_loss_elem: ['2.6558', '2.9196', '3.7883', '2.3317', '2.9741', '2.3649', '3.3527', '2.7984', '2.3547', '2.9879'], num_positive: 22.4000
2023-07-09 21:27:20,878 - INFO - task : ['barrier'], loss: 11.5021, hm_loss: 6.7298, loc_loss: 19.0889, loc_loss_elem: ['2.9508', '2.0841', '3.5892', '1.8986', '1.7477', '1.4164', '2.2023', '1.8822', '2.8069', '1.7783'], num_positive: 34.2000
2023-07-09 21:27:20,878 - INFO - task : ['motorcycle', 'bicycle'], loss: 13.0546, hm_loss: 8.3679, loc_loss: 18.7469, loc_loss_elem: ['2.0107', '2.9339', '4.1940', '1.6526', '1.2773', '1.7768', '2.2384', '2.7843', '1.8571', '2.0400'], num_positive: 41.0000
2023-07-09 21:27:20,878 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 18.9593, hm_loss: 12.1488, loc_loss: 27.2421, loc_loss_elem: ['2.8225', '3.0878', '4.6823', '2.6078', '3.1110', '2.6306', '3.2760', '3.4880', '2.9377', '4.0097'], num_positive: 29.8000

2023-07-09 21:27:34,247 - INFO - Epoch [1/20][20/15448]	lr: 0.00010, eta: 20 days, 1:15:53, time: 2.673, data_time: 1.715, transfer_time: 0.021, forward_time: 0.437, loss_parse_time: 0.000 memory: 7766, 
2023-07-09 21:27:34,247 - INFO - task : ['car'], loss: 7.3018, hm_loss: 4.2332, loc_loss: 12.2747, loc_loss_elem: ['1.0438', '1.2418', '3.3763', '0.7369', '1.7604', '0.6061', '1.6393', '2.1004', '1.5989', '1.1626'], num_positive: 49.4000
2023-07-09 21:27:34,247 - INFO - task : ['truck', 'construction_vehicle'], loss: 14.1740, hm_loss: 9.7614, loc_loss: 17.6503, loc_loss_elem: ['1.6838', '2.6925', '2.3009', '1.5748', '1.7685', '2.0444', '1.6115', '2.5455', '1.7810', '2.9730'], num_positive: 30.6000
2023-07-09 21:27:34,247 - INFO - task : ['bus', 'trailer'], loss: 22.9734, hm_loss: 15.6939, loc_loss: 29.1177, loc_loss_elem: ['1.9653', '2.6295', '4.8794', '2.6741', '4.8540', '2.0434', '4.9209', '3.8473', '4.8242', '3.4941'], num_positive: 19.0000
2023-07-09 21:27:34,247 - INFO - task : ['barrier'], loss: 11.5642, hm_loss: 5.5589, loc_loss: 24.0213, loc_loss_elem: ['4.3741', '2.1053', '4.1135', '2.3074', '2.2710', '1.8462', '2.1599', '2.2775', '3.9803', '2.1360'], num_positive: 32.8000
2023-07-09 21:27:34,247 - INFO - task : ['motorcycle', 'bicycle'], loss: 11.9095, hm_loss: 7.4250, loc_loss: 17.9382, loc_loss_elem: ['1.7692', '2.8661', '3.0344', '1.6079', '1.0668', '1.8657', '2.4863', '3.1576', '2.1147', '2.4845'], num_positive: 42.0000
2023-07-09 21:27:34,247 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 13.4539, hm_loss: 8.4542, loc_loss: 19.9990, loc_loss_elem: ['2.2295', '2.9165', '4.0986', '1.7874', '2.1307', '1.4518', '2.2649', '1.8076', '2.1415', '2.4287'], num_positive: 40.6000

.......

2023-07-10 02:06:40,629 - INFO - Epoch [1/20][6830/15448]	lr: 0.00011, eta: 8 days, 15:09:08, time: 2.241, data_time: 0.637, transfer_time: 0.018, forward_time: 1.306, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:06:40,629 - INFO - task : ['car'], loss: 1.5883, hm_loss: 1.0870, loc_loss: 2.0051, loc_loss_elem: ['0.1995', '0.2067', '0.2430', '0.0918', '0.0753', '0.0880', '0.7774', '0.9751', '0.3889', '0.3613'], num_positive: 30.4000
2023-07-10 02:06:40,629 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.0589, hm_loss: 1.4553, loc_loss: 2.4147, loc_loss_elem: ['0.2154', '0.2262', '0.3614', '0.1535', '0.1547', '0.1764', '0.1666', '0.3449', '0.5382', '0.4866'], num_positive: 35.4000
2023-07-10 02:06:40,629 - INFO - task : ['bus', 'trailer'], loss: 1.8101, hm_loss: 1.1679, loc_loss: 2.5687, loc_loss_elem: ['0.2064', '0.2135', '0.3522', '0.1034', '0.0990', '0.1240', '0.6667', '1.1368', '0.5485', '0.5609'], num_positive: 22.4000
2023-07-10 02:06:40,629 - INFO - task : ['barrier'], loss: 1.8850, hm_loss: 1.2924, loc_loss: 2.3704, loc_loss_elem: ['0.1753', '0.1766', '0.2282', '0.1886', '0.2550', '0.1378', '0.0398', '0.0670', '0.7153', '0.4721'], num_positive: 10.0000
2023-07-10 02:06:40,629 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.3836, hm_loss: 0.8034, loc_loss: 2.3206, loc_loss_elem: ['0.1548', '0.1649', '0.1787', '0.1698', '0.1068', '0.1270', '0.6427', '0.9525', '0.5001', '0.5994'], num_positive: 39.6000
2023-07-10 02:06:40,629 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.4710, hm_loss: 0.8603, loc_loss: 2.4427, loc_loss_elem: ['0.1435', '0.1566', '0.2066', '0.2206', '0.2492', '0.1684', '0.2490', '0.2981', '0.5378', '0.6507'], num_positive: 33.4000

2023-07-10 02:06:51,529 - INFO - Epoch [1/20][6835/15448]	lr: 0.00011, eta: 8 days, 15:07:52, time: 2.180, data_time: 0.267, transfer_time: 0.019, forward_time: 1.627, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:06:51,529 - INFO - task : ['car'], loss: 1.4195, hm_loss: 0.9496, loc_loss: 1.8796, loc_loss_elem: ['0.1860', '0.2031', '0.2257', '0.0755', '0.0716', '0.0957', '0.4902', '0.4735', '0.4142', '0.4150'], num_positive: 57.4000
2023-07-10 02:06:51,529 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.0764, hm_loss: 1.4540, loc_loss: 2.4893, loc_loss_elem: ['0.2232', '0.2075', '0.3721', '0.1433', '0.1466', '0.1621', '0.4089', '0.4865', '0.5134', '0.5420'], num_positive: 30.6000
2023-07-10 02:06:51,529 - INFO - task : ['bus', 'trailer'], loss: 2.2752, hm_loss: 1.5786, loc_loss: 2.7862, loc_loss_elem: ['0.2167', '0.2029', '0.4624', '0.0892', '0.1081', '0.1145', '0.9012', '1.2101', '0.5217', '0.6485'], num_positive: 20.4000
2023-07-10 02:06:51,529 - INFO - task : ['barrier'], loss: 1.6134, hm_loss: 1.0253, loc_loss: 2.3525, loc_loss_elem: ['0.1645', '0.1437', '0.1958', '0.1344', '0.2058', '0.1068', '0.0269', '0.0427', '0.7815', '0.6062'], num_positive: 12.4000
2023-07-10 02:06:51,529 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.2025, hm_loss: 0.6419, loc_loss: 2.2425, loc_loss_elem: ['0.1503', '0.1640', '0.1258', '0.1651', '0.0968', '0.1134', '0.6563', '0.8597', '0.5131', '0.6108'], num_positive: 42.8000
2023-07-10 02:06:51,529 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.2541, hm_loss: 0.6807, loc_loss: 2.2936, loc_loss_elem: ['0.1378', '0.1408', '0.1899', '0.1861', '0.2169', '0.1329', '0.3152', '0.3045', '0.6090', '0.5562'], num_positive: 40.0000

2023-07-10 02:07:07,907 - INFO - Epoch [1/20][6840/15448]	lr: 0.00011, eta: 8 days, 15:10:38, time: 3.276, data_time: 0.328, transfer_time: 0.019, forward_time: 2.667, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:07,907 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.6000
2023-07-10 02:07:07,907 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.4000
2023-07-10 02:07:07,907 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.6000
2023-07-10 02:07:07,907 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 37.4000
2023-07-10 02:07:07,907 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.0000
2023-07-10 02:07:07,907 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.8000

2023-07-10 02:07:18,348 - INFO - Epoch [1/20][6845/15448]	lr: 0.00011, eta: 8 days, 15:09:01, time: 2.088, data_time: 0.341, transfer_time: 0.020, forward_time: 1.452, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:18,349 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.6000
2023-07-10 02:07:18,349 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.8000
2023-07-10 02:07:18,349 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.4000
2023-07-10 02:07:18,349 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 11.6000
2023-07-10 02:07:18,349 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.2000
2023-07-10 02:07:18,349 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 37.0000

2023-07-10 02:07:28,850 - INFO - Epoch [1/20][6850/15448]	lr: 0.00011, eta: 8 days, 15:07:28, time: 2.100, data_time: 0.304, transfer_time: 0.020, forward_time: 1.506, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:28,850 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:07:28,850 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.2000
2023-07-10 02:07:28,851 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.0000
2023-07-10 02:07:28,851 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.0000
2023-07-10 02:07:28,851 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 43.0000
2023-07-10 02:07:28,851 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 51.6000

2023-07-10 02:07:38,815 - INFO - Epoch [1/20][6855/15448]	lr: 0.00011, eta: 8 days, 15:05:31, time: 1.993, data_time: 0.209, transfer_time: 0.019, forward_time: 1.508, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:38,815 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.6000
2023-07-10 02:07:38,815 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:07:38,815 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.8000
2023-07-10 02:07:38,815 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.2000
2023-07-10 02:07:38,815 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:07:38,815 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.0000

2023-07-10 02:07:56,825 - INFO - Epoch [1/20][6860/15448]	lr: 0.00011, eta: 8 days, 15:09:28, time: 3.602, data_time: 0.233, transfer_time: 0.019, forward_time: 3.080, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:07:56,825 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:07:56,825 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.0000
2023-07-10 02:07:56,825 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.2000
2023-07-10 02:07:56,825 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:07:56,826 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.0000
2023-07-10 02:07:56,826 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.8000

2023-07-10 02:08:06,961 - INFO - Epoch [1/20][6865/15448]	lr: 0.00011, eta: 8 days, 15:07:39, time: 2.027, data_time: 0.507, transfer_time: 0.020, forward_time: 1.235, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:06,961 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000
2023-07-10 02:08:06,962 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.0000
2023-07-10 02:08:06,962 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:08:06,962 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 35.6000
2023-07-10 02:08:06,962 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:08:06,962 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.0000

2023-07-10 02:08:16,884 - INFO - Epoch [1/20][6870/15448]	lr: 0.00011, eta: 8 days, 15:05:40, time: 1.984, data_time: 0.392, transfer_time: 0.019, forward_time: 1.312, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:16,884 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.4000
2023-07-10 02:08:16,884 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.8000
2023-07-10 02:08:16,884 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:08:16,884 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.8000
2023-07-10 02:08:16,884 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.0000
2023-07-10 02:08:16,884 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 27.8000

2023-07-10 02:08:27,032 - INFO - Epoch [1/20][6875/15448]	lr: 0.00011, eta: 8 days, 15:03:51, time: 2.030, data_time: 0.397, transfer_time: 0.019, forward_time: 1.355, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:27,032 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000
2023-07-10 02:08:27,032 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.4000
2023-07-10 02:08:27,032 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.6000
2023-07-10 02:08:27,032 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.6000
2023-07-10 02:08:27,032 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:08:27,032 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000

2023-07-10 02:08:45,481 - INFO - Epoch [1/20][6880/15448]	lr: 0.00011, eta: 8 days, 15:08:07, time: 3.690, data_time: 0.764, transfer_time: 0.019, forward_time: 2.643, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:45,481 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.2000
2023-07-10 02:08:45,481 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 35.0000
2023-07-10 02:08:45,481 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.0000
2023-07-10 02:08:45,481 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.8000
2023-07-10 02:08:45,481 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:08:45,481 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.0000

2023-07-10 02:08:56,070 - INFO - Epoch [1/20][6885/15448]	lr: 0.00011, eta: 8 days, 15:06:38, time: 2.118, data_time: 0.302, transfer_time: 0.020, forward_time: 1.524, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:08:56,071 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.6000
2023-07-10 02:08:56,071 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.6000
2023-07-10 02:08:56,071 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:08:56,071 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 43.4000
2023-07-10 02:08:56,071 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.0000
2023-07-10 02:08:56,071 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 49.8000

2023-07-10 02:09:05,727 - INFO - Epoch [1/20][6890/15448]	lr: 0.00011, eta: 8 days, 15:04:28, time: 1.931, data_time: 0.227, transfer_time: 0.020, forward_time: 1.411, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:05,728 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 45.8000
2023-07-10 02:09:05,728 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.2000
2023-07-10 02:09:05,728 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 22.0000
2023-07-10 02:09:05,728 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:09:05,728 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:09:05,728 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.6000

2023-07-10 02:09:15,867 - INFO - Epoch [1/20][6895/15448]	lr: 0.00011, eta: 8 days, 15:02:39, time: 2.028, data_time: 0.142, transfer_time: 0.020, forward_time: 1.608, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:15,867 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 44.2000
2023-07-10 02:09:15,867 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 31.2000
2023-07-10 02:09:15,867 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.2000
2023-07-10 02:09:15,867 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 27.4000
2023-07-10 02:09:15,867 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.4000
2023-07-10 02:09:15,867 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 28.8000

2023-07-10 02:09:33,447 - INFO - Epoch [1/20][6900/15448]	lr: 0.00011, eta: 8 days, 15:06:17, time: 3.516, data_time: 0.200, transfer_time: 0.019, forward_time: 3.026, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:33,448 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 62.0000
2023-07-10 02:09:33,448 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.8000
2023-07-10 02:09:33,448 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 19.0000
2023-07-10 02:09:33,448 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 12.6000
2023-07-10 02:09:33,448 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.2000
2023-07-10 02:09:33,448 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.4000

2023-07-10 02:09:43,988 - INFO - Epoch [1/20][6905/15448]	lr: 0.00011, eta: 8 days, 15:04:45, time: 2.108, data_time: 0.194, transfer_time: 0.019, forward_time: 1.637, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:43,989 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.8000
2023-07-10 02:09:43,989 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:09:43,989 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:09:43,989 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:09:43,989 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.0000
2023-07-10 02:09:43,989 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 50.8000

2023-07-10 02:09:54,640 - INFO - Epoch [1/20][6910/15448]	lr: 0.00011, eta: 8 days, 15:03:19, time: 2.130, data_time: 0.172, transfer_time: 0.019, forward_time: 1.671, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:09:54,641 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.6000
2023-07-10 02:09:54,641 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 32.6000
2023-07-10 02:09:54,641 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 21.8000
2023-07-10 02:09:54,641 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.8000
2023-07-10 02:09:54,641 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.6000
2023-07-10 02:09:54,641 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.0000

2023-07-10 02:10:04,009 - INFO - Epoch [1/20][6915/15448]	lr: 0.00011, eta: 8 days, 15:00:57, time: 1.874, data_time: 0.090, transfer_time: 0.019, forward_time: 1.504, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:04,009 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 46.8000
2023-07-10 02:10:04,009 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.0000
2023-07-10 02:10:04,009 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:10:04,009 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:10:04,009 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 39.6000
2023-07-10 02:10:04,009 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.8000

2023-07-10 02:10:22,903 - INFO - Epoch [1/20][6920/15448]	lr: 0.00011, eta: 8 days, 15:05:31, time: 3.779, data_time: 0.239, transfer_time: 0.020, forward_time: 3.257, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:22,903 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.0000
2023-07-10 02:10:22,903 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.4000
2023-07-10 02:10:22,903 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:10:22,903 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:10:22,903 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.8000
2023-07-10 02:10:22,903 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.8000

2023-07-10 02:10:33,636 - INFO - Epoch [1/20][6925/15448]	lr: 0.00011, eta: 8 days, 15:04:09, time: 2.147, data_time: 0.043, transfer_time: 0.020, forward_time: 1.819, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:33,637 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 47.4000
2023-07-10 02:10:33,637 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 33.8000
2023-07-10 02:10:33,637 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 24.0000
2023-07-10 02:10:33,637 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 25.6000
2023-07-10 02:10:33,637 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 42.2000
2023-07-10 02:10:33,637 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 38.0000

2023-07-10 02:10:42,867 - INFO - Epoch [1/20][6930/15448]	lr: 0.00011, eta: 8 days, 15:01:41, time: 1.846, data_time: 0.041, transfer_time: 0.019, forward_time: 1.519, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:42,867 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 36.4000
2023-07-10 02:10:42,868 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 34.2000
2023-07-10 02:10:42,868 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.4000
2023-07-10 02:10:42,868 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.4000
2023-07-10 02:10:42,868 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 41.2000
2023-07-10 02:10:42,868 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 45.8000

2023-07-10 02:10:55,155 - INFO - Epoch [1/20][6935/15448]	lr: 0.00011, eta: 8 days, 15:01:26, time: 2.458, data_time: 0.087, transfer_time: 0.020, forward_time: 2.091, loss_parse_time: 0.000 memory: 7964, 
2023-07-10 02:10:55,156 - INFO - task : ['car'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 48.8000
2023-07-10 02:10:55,156 - INFO - task : ['truck', 'construction_vehicle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 30.0000
2023-07-10 02:10:55,156 - INFO - task : ['bus', 'trailer'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 20.8000
2023-07-10 02:10:55,156 - INFO - task : ['barrier'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 23.0000
2023-07-10 02:10:55,156 - INFO - task : ['motorcycle', 'bicycle'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.6000
2023-07-10 02:10:55,156 - INFO - task : ['pedestrian', 'traffic_cone'], loss: nan, hm_loss: nan, loc_loss: nan, loc_loss_elem: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], num_positive: 40.6000

Then, the program end with the error:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049167, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803980 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1049166, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803981 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1575427 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1575428) of binary: .../.conda/envs/mvp/bin/python
Traceback (most recent call last):
  File ".../.conda/envs/mvp/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "..../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File .....conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "....../.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "......./.conda/envs/mvp/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
./tools/train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-10_02:41:36
  host      : AI-3090
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1575428)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1575428
========================================================

Could anyone solve this problem ?
@tianweiy

@tianweiy
Copy link
Owner

I suggest using openpcdet https://github.com/open-mmlab/OpenPCDet. This codebase is not actively maintained so that newer version of torch / cuda / apex may have some unknown issues

@friedang
Copy link

friedang commented Nov 6, 2024

I had the same issue but resolved it by specifing some crucial package's versions in the requirement.txt. I hope this helps @tianweiy

numba==0.53.1
matplotlib
fire
protobuf==3.17.0
opencv-python==4.5.2.54
opencv-contrib-python==4.5.3.56
pybind11
easydict
open3d-python==0.7.0.0
terminaltables
pytest-runner
addict
pycocotools==2.0.7
imagecorruptions
objgraph
cachetools
descartes
jupyter
matplotlib==3.3.4
motmetrics<=1.1.3
numpy==1.19.5
pandas>=0.24
Pillow<=6.2.1 # Latest Pillow is incompatible with current torchvision, pytorch/vision#1712
pyquaternion>=0.9.5
scikit-learn==0.24.2
Shapely==1.8.5.post1
tqdm
pyyaml==6.0.1
requests
nuscenes-devkit==1.0.5
transforms3d==0.4.1
screeninfo==0.8.1
spconv-cu113==2.1.16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants