Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to disable SyncBN for single GPU training #255

Closed
yaniv-f opened this issue Jan 4, 2021 · 4 comments
Closed

How to disable SyncBN for single GPU training #255

yaniv-f opened this issue Jan 4, 2021 · 4 comments

Comments

@yaniv-f
Copy link

yaniv-f commented Jan 4, 2021

Hi. This is a follow-up to #29 . I have a single GPU and I would like to train PointsPillars on NuScenes using (for example)
python tools/train.py --no-validate --gpus 1 configs/pointpillars/hv_pointpillars_fpn_sbn-all_4x8_2x_nus-3d.py

I'm getting the error
"Default process group is not initialized"
AssertionError: Default process group is not initialized
as in issue 29.

Could you please provide some details on how to disable SyncBN and use BN instead ?

I have an additional question: is there a way to reduce GPU memory usage during training, so I can use a GTX 1080 with 8GB ? Will setting samples_per_gpu=1, workers_per_gpu=1 be sufficient ?

Thanks

Yaniv

@Tai-Wang
Copy link
Member

Tai-Wang commented Jan 5, 2021

You can change configs like this and use correct corresponding BN to replace SyncBNs.

To reduce GPU memory usage, I would suggest you try fp16 configs or other datasets (like KITTI, which needs smaller models). I could not guarantee setting samples_per_gpu=1 is enough in your case. Anyway you can definitely feel free to have a try.

@yaniv-f
Copy link
Author

yaniv-f commented Jan 5, 2021

Hi. Thanks for the quick reply! I replaced the file configs/base/models/hv_pointpillars_fpn_nus.py with the one in your reply. I believe indeed now all norm operations are using naïve sync:
grep -i norm configs/base/models/hv_pointpillars_fpn_nus.py
norm_cfg=dict(type='naiveSyncBN1d', eps=1e-3, momentum=0.01)),
norm_cfg=dict(type='naiveSyncBN2d', eps=1e-3, momentum=0.01),
norm_cfg=dict(type='naiveSyncBN2d', eps=1e-3, momentum=0.01),

However, I am still seeing the same issue. The log of the call of the training is below:
python tools/train.py --gpus 1 configs/pointpillars/hv_pointpillars_fpn_sbn-all_4x8_2x_nus-3d.py &> train_log.txt

Please note I have installed the most recent NVidia driver, CUDA and PyTorch. Data preperation phase on NuScenes went OK (took a very long time to complete).
nvidia-smi output:
NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2

for some reason, I see the following in the log:
gpu_ids = range(0, 1)

2021-01-05 11:50:51,677 - mmdet - INFO - Environment info:

sys.platform: linux
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
CUDA available: True
GPU 0: Tesla V100-PCIE-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.2.r11.2/compiler.29373293_0
GCC: gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.0
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2
OpenCV: 4.5.1
MMCV: 1.2.5
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.2
MMDetection: 2.8.0
MMDetection3D: 0.9.0+d006848

2021-01-05 11:50:51,678 - mmdet - INFO - Distributed training: False
2021-01-05 11:50:52,763 - mmdet - INFO - Config:
voxel_size = [0.25, 0.25, 8]
model = dict(
type='MVXFasterRCNN',
pts_voxel_layer=dict(
max_num_points=64,
point_cloud_range=[-50, -50, -5, 50, 50, 3],
voxel_size=[0.25, 0.25, 8],
max_voxels=(30000, 40000)),
pts_voxel_encoder=dict(
type='HardVFE',
in_channels=4,
feat_channels=[64, 64],
with_distance=False,
voxel_size=[0.25, 0.25, 8],
with_cluster_center=True,
with_voxel_center=True,
point_cloud_range=[-50, -50, -5, 50, 50, 3],
norm_cfg=dict(type='naiveSyncBN1d', eps=0.001, momentum=0.01)),
pts_middle_encoder=dict(
type='PointPillarsScatter', in_channels=64, output_shape=[400, 400]),
pts_backbone=dict(
type='SECOND',
in_channels=64,
norm_cfg=dict(type='naiveSyncBN2d', eps=0.001, momentum=0.01),
layer_nums=[3, 5, 5],
layer_strides=[2, 2, 2],
out_channels=[64, 128, 256]),
pts_neck=dict(
type='FPN',
norm_cfg=dict(type='naiveSyncBN2d', eps=0.001, momentum=0.01),
act_cfg=dict(type='ReLU'),
in_channels=[64, 128, 256],
out_channels=256,
start_level=0,
num_outs=3),
pts_bbox_head=dict(
type='Anchor3DHead',
num_classes=10,
in_channels=256,
feat_channels=256,
use_direction_classifier=True,
anchor_generator=dict(
type='AlignedAnchor3DRangeGenerator',
ranges=[[-50, -50, -1.8, 50, 50, -1.8]],
scales=[1, 2, 4],
sizes=[[0.866, 2.5981, 1.0], [0.5774, 1.7321, 1.0],
[1.0, 1.0, 1.0], [0.4, 0.4, 1]],
custom_values=[0, 0],
rotations=[0, 1.57],
reshape_out=True),
assigner_per_size=False,
diff_rad_by_sin=True,
dir_offset=0.7854,
dir_limit_offset=0,
bbox_coder=dict(type='DeltaXYZWLHRBBoxCoder', code_size=9),
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0),
loss_bbox=dict(
type='SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0),
loss_dir=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.2)))
train_cfg = dict(
pts=dict(
assigner=dict(
type='MaxIoUAssigner',
iou_calculator=dict(type='BboxOverlapsNearest3D'),
pos_iou_thr=0.6,
neg_iou_thr=0.3,
min_pos_iou=0.3,
ignore_iof_thr=-1),
allowed_border=0,
code_weight=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2],
pos_weight=-1,
debug=False))
test_cfg = dict(
pts=dict(
use_rotate_nms=True,
nms_across_levels=False,
nms_pre=1000,
nms_thr=0.2,
score_thr=0.05,
min_bbox_size=0,
max_num=500))
point_cloud_range = [-50, -50, -5, 50, 50, 3]
class_names = [
'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle',
'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
]
dataset_type = 'NuScenesDataset'
data_root = 'data/nuscenes/'
input_modality = dict(
use_lidar=True,
use_camera=False,
use_radar=False,
use_map=False,
use_external=False)
file_client_args = dict(backend='disk')
train_pipeline = [
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=5,
use_dim=5,
file_client_args=dict(backend='disk')),
dict(
type='LoadPointsFromMultiSweeps',
sweeps_num=10,
file_client_args=dict(backend='disk')),
dict(type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True),
dict(
type='GlobalRotScaleTrans',
rot_range=[-0.3925, 0.3925],
scale_ratio_range=[0.95, 1.05],
translation_std=[0, 0, 0]),
dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
dict(
type='PointsRangeFilter', point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='ObjectRangeFilter', point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='ObjectNameFilter',
classes=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
]),
dict(type='PointShuffle'),
dict(
type='DefaultFormatBundle3D',
class_names=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
]),
dict(type='Collect3D', keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
]
test_pipeline = [
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=5,
use_dim=5,
file_client_args=dict(backend='disk')),
dict(
type='LoadPointsFromMultiSweeps',
sweeps_num=10,
file_client_args=dict(backend='disk')),
dict(
type='MultiScaleFlipAug3D',
img_scale=(1333, 800),
pts_scale_ratio=1,
flip=False,
transforms=[
dict(
type='GlobalRotScaleTrans',
rot_range=[0, 0],
scale_ratio_range=[1.0, 1.0],
translation_std=[0, 0, 0]),
dict(type='RandomFlip3D'),
dict(
type='PointsRangeFilter',
point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='DefaultFormatBundle3D',
class_names=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone',
'barrier'
],
with_label=False),
dict(type='Collect3D', keys=['points'])
])
]
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='NuScenesDataset',
data_root='data/nuscenes/',
ann_file='data/nuscenes/nuscenes_infos_train.pkl',
pipeline=[
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=5,
use_dim=5,
file_client_args=dict(backend='disk')),
dict(
type='LoadPointsFromMultiSweeps',
sweeps_num=10,
file_client_args=dict(backend='disk')),
dict(
type='LoadAnnotations3D',
with_bbox_3d=True,
with_label_3d=True),
dict(
type='GlobalRotScaleTrans',
rot_range=[-0.3925, 0.3925],
scale_ratio_range=[0.95, 1.05],
translation_std=[0, 0, 0]),
dict(type='RandomFlip3D', flip_ratio_bev_horizontal=0.5),
dict(
type='PointsRangeFilter',
point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='ObjectRangeFilter',
point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='ObjectNameFilter',
classes=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone',
'barrier'
]),
dict(type='PointShuffle'),
dict(
type='DefaultFormatBundle3D',
class_names=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone',
'barrier'
]),
dict(
type='Collect3D',
keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
],
classes=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
],
modality=dict(
use_lidar=True,
use_camera=False,
use_radar=False,
use_map=False,
use_external=False),
test_mode=False,
box_type_3d='LiDAR'),
val=dict(
type='NuScenesDataset',
data_root='data/nuscenes/',
ann_file='data/nuscenes/nuscenes_infos_val.pkl',
pipeline=[
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=5,
use_dim=5,
file_client_args=dict(backend='disk')),
dict(
type='LoadPointsFromMultiSweeps',
sweeps_num=10,
file_client_args=dict(backend='disk')),
dict(
type='MultiScaleFlipAug3D',
img_scale=(1333, 800),
pts_scale_ratio=1,
flip=False,
transforms=[
dict(
type='GlobalRotScaleTrans',
rot_range=[0, 0],
scale_ratio_range=[1.0, 1.0],
translation_std=[0, 0, 0]),
dict(type='RandomFlip3D'),
dict(
type='PointsRangeFilter',
point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='DefaultFormatBundle3D',
class_names=[
'car', 'truck', 'trailer', 'bus',
'construction_vehicle', 'bicycle', 'motorcycle',
'pedestrian', 'traffic_cone', 'barrier'
],
with_label=False),
dict(type='Collect3D', keys=['points'])
])
],
classes=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
],
modality=dict(
use_lidar=True,
use_camera=False,
use_radar=False,
use_map=False,
use_external=False),
test_mode=True,
box_type_3d='LiDAR'),
test=dict(
type='NuScenesDataset',
data_root='data/nuscenes/',
ann_file='data/nuscenes/nuscenes_infos_val.pkl',
pipeline=[
dict(
type='LoadPointsFromFile',
coord_type='LIDAR',
load_dim=5,
use_dim=5,
file_client_args=dict(backend='disk')),
dict(
type='LoadPointsFromMultiSweeps',
sweeps_num=10,
file_client_args=dict(backend='disk')),
dict(
type='MultiScaleFlipAug3D',
img_scale=(1333, 800),
pts_scale_ratio=1,
flip=False,
transforms=[
dict(
type='GlobalRotScaleTrans',
rot_range=[0, 0],
scale_ratio_range=[1.0, 1.0],
translation_std=[0, 0, 0]),
dict(type='RandomFlip3D'),
dict(
type='PointsRangeFilter',
point_cloud_range=[-50, -50, -5, 50, 50, 3]),
dict(
type='DefaultFormatBundle3D',
class_names=[
'car', 'truck', 'trailer', 'bus',
'construction_vehicle', 'bicycle', 'motorcycle',
'pedestrian', 'traffic_cone', 'barrier'
],
with_label=False),
dict(type='Collect3D', keys=['points'])
])
],
classes=[
'car', 'truck', 'trailer', 'bus', 'construction_vehicle',
'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier'
],
modality=dict(
use_lidar=True,
use_camera=False,
use_radar=False,
use_map=False,
use_external=False),
test_mode=True,
box_type_3d='LiDAR'))
evaluation = dict(interval=24)
optimizer = dict(type='AdamW', lr=0.001, weight_decay=0.01)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=1000,
warmup_ratio=0.001,
step=[20, 23])
momentum_config = None
total_epochs = 24
checkpoint_config = dict(interval=1)
log_config = dict(
interval=50,
hooks=[dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = './work_dirs/hv_pointpillars_fpn_sbn-all_4x8_2x_nus-3d'
load_from = None
resume_from = None
workflow = [('train', 1)]
gpu_ids = range(0, 1)

2021-01-05 11:50:52,763 - mmdet - INFO - Set random seed to 0, deterministic: False
2021-01-05 11:50:52,828 - mmdet - INFO - Model:
MVXFasterRCNN(
(pts_voxel_layer): Voxelization(voxel_size=[0.25, 0.25, 8], point_cloud_range=[-50, -50, -5, 50, 50, 3], max_num_points=64, max_voxels=(30000, 40000))
(pts_voxel_encoder): HardVFE(
(scatter): DynamicScatter(voxel_size=[0.25, 0.25, 8], point_cloud_range=[-50, -50, -5, 50, 50, 3], average_points=True)
(vfe_layers): ModuleList(
(0): VFELayer(
(norm): NaiveSyncBatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(linear): Linear(in_features=10, out_features=64, bias=False)
)
(1): VFELayer(
(norm): NaiveSyncBatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(linear): Linear(in_features=128, out_features=64, bias=False)
)
)
)
(pts_middle_encoder): PointPillarsScatter()
(pts_backbone): SECOND(
(blocks): ModuleList(
(0): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): NaiveSyncBatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): NaiveSyncBatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(7): NaiveSyncBatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(10): NaiveSyncBatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
)
(1): Sequential(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(7): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(10): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(13): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(14): ReLU(inplace=True)
(15): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(16): NaiveSyncBatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(17): ReLU(inplace=True)
)
(2): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(7): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
(9): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(10): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(13): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(14): ReLU(inplace=True)
(15): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(16): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(17): ReLU(inplace=True)
)
)
)
(pts_neck): FPN(
(lateral_convs): ModuleList(
(0): ConvModule(
(conv): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
(1): ConvModule(
(conv): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
(2): ConvModule(
(conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
)
(fpn_convs): ModuleList(
(0): ConvModule(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
(1): ConvModule(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
(2): ConvModule(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): NaiveSyncBatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
(activate): ReLU()
)
)
)
(pts_bbox_head): Anchor3DHead(
(loss_cls): FocalLoss()
(loss_bbox): SmoothL1Loss()
(loss_dir): CrossEntropyLoss()
(conv_cls): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1))
(conv_reg): Conv2d(256, 72, kernel_size=(1, 1), stride=(1, 1))
(conv_dir_cls): Conv2d(256, 16, kernel_size=(1, 1), stride=(1, 1))
)
)
2021-01-05 11:51:06,213 - mmdet - INFO - Start running, host: user@host, work_dir: /home/user/mmdetection/mmdetection3d/work_dirs/hv_pointpillars_fpn_sbn-all_4x8_2x_nus-3d
2021-01-05 11:51:06,213 - mmdet - INFO - workflow: [('train', 1)], max: 24 epochs
Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/home/user/mmdetection/mmdet/apis/train.py", line 150, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/user/mmdetection/mmdet/models/detectors/base.py", line 246, in train_step
losses = self(**data)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/detectors/base.py", line 59, in forward
return self.forward_train(**kwargs)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 274, in forward_train
points, img=img, img_metas=img_metas)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 208, in extract_feat
pts_feats = self.extract_pts_feat(points, img_feats, img_metas)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/detectors/mvx_two_stage.py", line 197, in extract_pts_feat
img_feats, img_metas)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/voxel_encoders/voxel_encoder.py", line 444, in forward
voxel_feats = vfe(voxel_feats)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/models/voxel_encoders/utils.py", line 86, in forward
x = self.norm(x.permute(0, 2, 1).contiguous()).permute(0, 2,
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/home/user/mmdetection/mmdetection3d/mmdet3d/ops/norm.py", line 57, in forward
if dist.get_world_size() == 1 or not self.training:
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
return _get_group_size(group)
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/home/user/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized

I'm also seeing the same issue on a 2nd machine with dual 2080Ti. I will try to get logs.

Thanks and best regards,

Yaniv

@Tai-Wang
Copy link
Member

Tai-Wang commented Jan 5, 2021

I think you may misunderstand my suggestion. I mean you need to replace all the lines using naiveSyncBN[x]d by BN[x]d. For example, you need to change this line to norm_cfg=dict(type='BN1d').

@yaniv-f
Copy link
Author

yaniv-f commented Jan 5, 2021

Hi. Thanks for the clarifications ! I was able to start training on my single GPU system. I'm closing this topic. Thanks again for the quick response and best regards,

Yaniv

@yaniv-f yaniv-f closed this as completed Jan 5, 2021
tpoisonooo pushed a commit to tpoisonooo/mmdetection3d that referenced this issue Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants