-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault on CUDA 11.0/torch 1.7.1 #1
Comments
Cuda 11 is supported I think. Try using my distributed launching script and set num of gpus to be 1.
Best,
Xiaoyang
… 在 2021年8月20日,上午11:51,Yuxuan Liu ***@***.***> 写道:
Thank you for your great contribution.
CUDA 11.0?
I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)
But after it start training for the first step, the code ends with error:
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>.
Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.
python3 tools/train.py --cfg configs/stereo/kitti_models/liga.3d-and-bev.yaml --launcher=none --batch_size 1
Segmentation fault (core dumped)
I have not fully investigated where does it happen.
CUDA 10
I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
In my first try, I used the original launching script and it failed without any additional information. CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=0', '--launcher', 'pytorch', '--fix_random_seed', '--sync_bn', '--save_to_file', '--cfg_file', 'configs/stereo/kitti_models/liga.3d-and-bev.yaml', '--exp_name', 'exp_name']' died with <Signals.SIGSEGV: 11>. I then started without distributed because I want to find out the error, and it turns out to be a segmentation fault. |
|
It's weird. Usually it will output more error messages. btw, did you pull the latest commit? |
The error happened in here x = self.conv_input(input_sp_tensor) However, I did not see any error during my compilation and installation of spconv. >>> torch.__version__
'1.7.1+cu110'
>>> torch.version.cuda
'11.0' |
The possible reasons might be:
Can you do some double check? |
The problem maybe that my nvcc version is 11.1 while everything else is 11.0 |
I think you can use the latest pytorch version |
@Owen-Liuyuxuan |
Sorry I have not been working on this for a while :( and have not tried that. |
Docker environment:
run command:
run command:
It starts but still produces segmentation fault and stop here similar to the original result |
Can you try run the code step by step to see which step?
Best,
Xiaoyang
… 在 2021年9月28日,下午3:06,Yuxuan Liu ***@***.***> 写道:
Docker environment:
torch==1.9.1+cu111 torchvision==0.10.1+cu111 mmcv-full=1.2.0 nvcc==11.1.TC455_06 on a RTX 3090 server.
run command:
CUDA_VISIBLE_DEVICES=0 ./scripts/dist_train.sh 1 exp_name configs/stereo/kitti_models/liga.3d-and-bev.yaml
+ python3 -m torch.distributed.launch --nproc_per_node=1 tools/train.py --launcher pytorch --fix_random_seed --sync_bn --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name exp_name
freezes and no output.
ctrl+c: not much useful information comes out.
run command:
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py --launcher none --fix_random_seed --save_to_file --cfg_file configs/stereo/kitti_models/liga.3d-and-bev.yaml --exp_name debug
It starts but still produces segmentation fault and stop here similar to the original result
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
I have tried that (by sync and printing along the way), and it stops here: x = self.conv_input(input_sp_tensor) https://github.com/xy-guo/LIGA-Stereo/blob/master/liga/models/backbones_3d_lidar/spconv_backbone.py#L385 |
I'm not sure what causes the problem. I've tested my code on a 3070 notebook and everything is fine. I'm not sure if there is a possibility that docker causes the problem? |
Another suggestion is that do not use --launcher none, the code is only available in distributed mode. |
The problem is that if the code is launch in distributed mode, I can not get any error message (and any other training logs) and the child process just dies... I have to run in local mode to actually debug. |
|
Have you solved the problem? Maybe you can try using the latest commit of spconv? |
I have tried following your advice, but it is still the same as before. Now my CUDA 10.2, install spconv by offical 'pip install spconv-cu102' , I will try it in CUDA 11.1. |
Hi, I faced this problem too. My env is: ubuntu=20.0.6, python=3.7, cuda=11.1, pytorch=1.7.1. My GPU is RTX 8000. Command I run was: Pip list is as follows: addict 2.4.0 The error logs are as follows:
size mismatch for layer3.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 128, 3, 3]). 2022-03-24 22:10:59,122 INFO ********************** Model create finished ********************** Any ideas? Thanks in advance. |
Same fault with CUDA11.1 and pytorch==1.8.0 |
Hi, have you solved this problem? I meet same error massages. |
Same problem with nvcc 10.1, nvidia-smi 10.2, pytorch 1.6.0 + cudatoolkit 10.1, mmcvfull 1.2.1, mmdet 2.6.0 and graphic cards Tesla V100s |
Thank you for your great contribution.
CUDA 11.0?
I do manage to compile everything in a docker with CUDA 11.0/pytorch 1.7.1. including spconv (it seems that spconv show no error in build and install)
But after it start training for the first step, the code ends with error:
Then I rewrite your code for single GPU training without distributed training (the re-written code is in my fork repo). Everything looks the same and it turns out to be a segmentation fault.
I have not fully investigated where does it happen.
CUDA 10
I then try using a lower CUDA version, but 3090 only supports CUDA 11+, and the current model is too large to fit into a single 1080Ti/2080Ti (similar to DSGN?).
The text was updated successfully, but these errors were encountered: