-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM leads to unhandled thrust::system exception #357
Comments
Thanks for the suggestion. I have added THRUST_CHECKs on the coordinate initialization functions. It will return std::runtime_error which will be converted to a python error using pybind11. |
Thank you, I merged the changes (and the wrapper it to the same functions in the I implemented a custom
This raises Then re-compiled the code in debug mode and the same error occurs:
I also tested intentionally throwing an exception from the same function call, and it was correctly raised and I could catch it in Python. So this
For testing I removed all noexcept clauses from I keep investigating the issue as this kills most of our trainings (we have very high number of points). My best guess is that some of the 3rdparty |
Hmm, I couldn't reproduce the error with my Titan RTX. I tried changing the number of points from 10M to 220M, but no luck. Anyway, I pushed another commit to cover the thrust errors for
|
I did the same with decompose calls before, but it didn't solve the issue. I'll follow up with the core dump to see the traceback, because I'm missing where it all goes wrong. (Everything is guarded but I'm still getting a terminate which could not actually happen..) |
Ah sorry, I found that I didn't wrap |
Thanks for the update, I had a deepdive with
Results in the same error as before:
I tried adding a Traceback:
So there is an exception raised in the block, but it's not caught by the As a workaround my idea is to check before that we have enough space for the |
I can also reproduce the issue with RTX 2070 and point number: 2500000 |
Hi Chris, We've reproduced the issue in pure thrust code, so it's not a problem with MinkowskiEngine. I raised the issue in the thrust repo here: NVIDIA/thrust#1448. The issue is not present anymore with CUDA 11.0+, so we are migrating the codebase over if it's possible. Thanks for looking into it. |
Thanks @evelkey for the update. I'll close the issue and put some note in the readme. |
Describe the bug
ME raises a C++
thrust::system::system_error
exception which cannot be handled from Python and crashes the program. This issue is raised non-deterministically during training (especially in long running trainings after a few days) and cannot be caught from Python leading to a failing training pipeline.As
parallel_for
is not used directly in the repo, most likely one of the functions inMinkowskiConvolution
use athrust
builtin function which utilizes it. This function call should be wrapped withTHRUST_CHECK
likeCUDA_CHECK
to create an exception which can be interpreted in Python.To Reproduce
The problem is GPU dependent, the below code is deterministically producing the error on a 16 GB Tesla V100 GPU. To reproduce on other GPUs (mostly dependent on VRAM size), one needs to find the optimal
point_count
in the below code.Expected behavior
A
thrust::system::system_error
exception should be converted to a PythonRuntimeError
orMemoryError
so that it can be caught with atry .. except
block in Python.Server (running inside Nvidia Docker):
==========System==========
Linux-5.4.0-1047-aws-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
==========Pytorch==========
1.7.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.73.01
CUDA Version 11.2
VBIOS Version 88.00.4F.00.09
Image Version G503.0201.00.03
==========NVCC==========
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
==========CC==========
/usr/bin/c++
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
==========MinkowskiEngine==========
0.5.4 (master of 05/26/2021)
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 10020
CUDART version MinkowskiEngine is compiled: 10020
The text was updated successfully, but these errors were encountered: