Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580

Open
tastyminerals opened this issue Nov 1, 2019 · 29 comments

Comments

@tastyminerals
Copy link

tastyminerals commented Nov 1, 2019

I am training a version of unet with joint classification and semantic segmentation using O1 level. The training crashes after I explicitly cast box_coord_tensor in roi_pool function.

rois = roi_pool(
        input=classification_feature_map_tensor, # FLOAT16 
        boxes=box_coord_tensor.half(), # FLOAT32 IF NOT CASTED EXPLICITLY
        output_size=roi_size,
        spatial_scale=1,
)

Thing is, classification_feature_map_tensor comes as float16 since it is handled by amp while box_coord_tensor comes from input batch which is float32. However, roi_pool requires tensors to have equal precision and throws

RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type Variable[CUDAHalfType] does not equal Variable[CUDAFloatType] (while checking arguments for ROIPool_forward_cuda) (checkSameType at /pytorch/aten/src/ATen/TensorUtils.cpp:140)

But if I cast box_coord_tensor to float16, CUDA throws memory access error below.

  File "/usr/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
    post_backward_models_are_masters(scaler, params, stashed_grads)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
    scale_override=grads_have_scale/out_scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Is there anything I could try to do because so far any attempts result in the error above.

@mcarilli
Copy link
Contributor

mcarilli commented Nov 3, 2019

When in doubt, always prefer casting to FP32. In this case (I think) you're calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work.

@tastyminerals
Copy link
Author

tastyminerals commented Nov 3, 2019

I casted everything to float32

rois = roi_pool(
    input=classification_feature_map_tensor.float(), 
    boxes=box_coord_tensor.float(),
    output_size=self.roi_size,
    spatial_scale=1,
)

The roi_pool works and passes but the exception is thrown in apex here

with amp.scale_loss(loss, self.optimizer) as scaled_loss:
    scaled_loss.backward() # exception is thrown

inside the training loop below

        for epoch in range(1, self.num_epochs + 1):
            logger.info(f"running epoch {epoch}")
            avg_train_loss = 0

            self.model.train()
            for step, sample_batch in enumerate(self.train_data, start=1):
                sample_batch = self._sample_to_device(sample_batch)
                self.optimizer.zero_grad()

                doc_id_batch = sample_batch[DOC_ID]

                logits_dict = self.model(sample_batch)
                loss = self.criterion(logits_dict, sample_batch)
                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()  # exception is thrown

                self.optimizer.step()

                avg_train_loss += loss.item()

            epoch_end_time = timeit.default_timer()
            epoch_time = epoch_end_time - epoch_start_time

@tastyminerals
Copy link
Author

tastyminerals commented Nov 4, 2019

Below are some training logs with O2 just before the crash. You can even see that epoch 1 completed with nan loss though.

2019-11-04 10:35:43,186 - INFO - __main__ - starting training
2019-11-04 10:35:43,186 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 10:35:43,190 - INFO - net.train.trainer - running epoch 1
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
2019-11-04 10:35:53,378 - INFO - net.train.trainer - epoch 1; average train loss nan; processed 10 batches in 10.19 seconds, 1.02 sec per batch on average
2019-11-04 10:35:53,379 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss nan
2019-11-04 10:35:56,085 - INFO - net.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-04 10:35:56,085 - INFO - net.train.trainer - running epoch 2
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
(...)
  File "/usr/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/usr/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
    models_are_masters=False)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 89, in unscale_python
    self.dynamic)
  File "/usr/lib/python3.7/site-packages/apex/amp/scaler.py", line 9, in scale_check_overflow_python
    cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered

Now with O3 we get a little bit further and with a crash duing summing the validation loss.

Selected optimization level O3:  Pure FP16 training.
Defaults for this optimization level are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O3
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : False
master_weights         : False
loss_scale             : 1.0
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 13:19:25,347 - INFO - __main__ - starting training
2019-11-04 13:19:25,347 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 13:19:25,351 - INFO - net.train.trainer - running epoch 1
2019-11-04 13:19:35,604 - INFO - net.train.trainer - epoch 1; average train loss 3.7108697175979612; processed 10 batches in 10.25 seconds, 1.03 sec per batch on average
2019-11-04 13:19:35,605 - INFO - net.train.trainer - epoch 1; starting validation
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: validation loss 3.0665794213612876
2019-11-04 13:19:38,362 - INFO - net.train.trainer - epoch 1: better model found, new best validation loss: 3.0665794213612876
2019-11-04 13:19:38,367 - INFO - net.train.trainer - running epoch 2
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; average train loss 2.4132291316986083; processed 10 batches in 10.08 seconds, 1.01 sec per batch on average
2019-11-04 13:19:48,451 - INFO - net.train.trainer - epoch 2; starting validation
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: validation loss 2.798730452855428
2019-11-04 13:19:51,411 - INFO - net.train.trainer - epoch 2: better model found, new best validation loss: 2.798730452855428
2019-11-04 13:19:51,416 - INFO - net.train.trainer - running epoch 3
...
  File "/home/user/net/train/trainer.py", line 138, in train
    avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered

Running the training with CUDA_LAUNCH_BLOCKING=1 gives us:

   trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/user/net/train/trainer.py", line 131, in train
    scaled_loss.backward()
  File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

@tastyminerals
Copy link
Author

tastyminerals commented Nov 4, 2019

Could it be related to this? So does it mean that we are running out of memory? But nvidia-smi tells that we use only 50% GPU.

2.1.10. GEMM Algorithms Numerical Behavior
Some GEMM algorithms split the computation along the dimension K to increase the GPU occupancy, especially when the dimension K is large compared to dimensions M and N. When this type of algorithm is chosen by the cuBLAS heuristics or explicitly by the user, the results of each split is summed deterministically into the resulting matrix to get the final result.
For the routines cublasgemmEx and cublasGemmEx, when the compute type is greater than the output type, the sum of the split chunks can potentially lead to some intermediate overflows thus producing a final resulting matrix with some overflows. Those overflows might not have occured if all the dot products had been accumulated in the compute type before being converted at the end in the output type.
This computation side-effect can be easily exposed when the computeType is CUDA_R_32F and Atype, Btype and Ctype are in CUDA_R_16F.

@mcarilli
Copy link
Contributor

mcarilli commented Nov 4, 2019

I don't think it's running out of memory. With O1, for the backward pass (#580 (comment)) does it error on the very first backward pass? And what is the exception trace that is thrown?

@tastyminerals
Copy link
Author

tastyminerals commented Nov 4, 2019

Correct, with O1 it fails on the first backward pass. With O2 it finishes two epochs and with O3 finishes three epochs. With O0 it does not crash.
Below is the run with O1 opt-level.

CUDA_LAUNCH_BLOCKING=1 python train.py --config-file config/config.gin --log-level INFO                                                                                                 
2019-11-04 19:29:08,258 - INFO - __main__ - setting random seed to 42
2019-11-04 19:29:08,258 - INFO - __main__ - setting up train data
2019-11-04 19:29:08,264 - INFO - __main__ - split data with valid fraction 0.2 --> # train data: 40, # valid data: 10
2019-11-04 19:29:08,268 - INFO - net.utils.class_weights - calculating class weights with c=1.04 for box weights and c=1.04 for segmentation weights
2019-11-04 19:29:16,816 - INFO - net.utils.class_weights - calculated box class weights: tensor([ 1.5608, 21.2831, 22.9914, 16.3494, 23.2191, 21.6754, 25.2760, 25.3858,
        23.1732, 25.0054, 19.9499, 10.7810, 19.6184, 20.9051])
2019-11-04 19:29:16,817 - INFO - net.utils.class_weights - calculated segmentation class weights: tensor([0.0821, 0.1714, 0.1662, 0.1396, 0.1677, 0.1864, 0.1912, 0.2489, 0.1080])
2019-11-04 19:29:16,832 - INFO - __main__ - setting up loss function
2019-11-04 19:29:16,832 - INFO - __main__ - combining loss by sum with box loss weight 1.0 and segmentation loss weight 1.0
2019-11-04 19:29:16,832 - INFO - __main__ - setting up model
2019-11-04 19:29:16,891 - INFO - __main__ - setting up trainer instance
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-04 19:29:22,263 - INFO - __main__ - starting training
2019-11-04 19:29:22,263 - INFO - net.train.trainer - starting training of model, going to train 100 epochs
2019-11-04 19:29:22,263 - INFO - net.train.trainer - running epoch 1
...
  File "train.py", line 267, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/user/net/train/trainer.py", line 132, in train
    scaled_loss.backward()
  File "/home/user/.local/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`

@tastyminerals
Copy link
Author

tastyminerals commented Nov 4, 2019

According to these docs CUBLAS_STATUS_EXECUTION_FAILED means "the function failed to launch on the GPU". I wonder what could be the possible reasons for that since it is launched on GPU several times before it crashes.

Batch size does not change the behavior. I also tried running with nightly pytorch builds, same results. Tried running on different machines GTX1070 and GTX1080Ti, same error. The apex example imagenet network runs without errors though so it is something with our model.

@tastyminerals tastyminerals changed the title "CUDA error: an illegal memory access" with explicit cast to float16 CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs Nov 4, 2019
@tastyminerals tastyminerals changed the title CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs Nov 4, 2019
@ptrblck
Copy link
Contributor

ptrblck commented Nov 7, 2019

@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?

@anjani-dhrangadhariya
Copy link

anjani-dhrangadhariya commented Nov 7, 2019

I get a similar error with the forward pass. After some batches, it gives the following error(s).

Sometimes it is error 1 and sometimes it is error 2 or error 3.
Sometimes the error is thrown after processing 1st batch and sometimes at 2nd,9th or 13th, 17th, 21st batch.

Error 1
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)``

Error 2
RuntimeError: CUDA error: device-side assert triggered

Error 3
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCBlas.cu:258

Maybe this issue discussion can bring more perspective to it.

@tastyminerals
Copy link
Author

tastyminerals commented Nov 9, 2019

I managed to train the model without crashing (at least reach 10th epoch) with batch_size=1 and O2 opt-level. Anything else leads to an exception.

batch_size=1, opt-level=O1 --> crashes after couple of epochs
batch_size=1, opt-level=O2 --> works fine
batch_size=1, opt-level=O3 --> crashes after couple of epochs

batch_size=2, opt-level=O1 --> crashes after couple of epochs
batch_size=2, opt-level=O2 --> crashes after couple of epochs
batch_size=2, opt-level=O3 --> crashes after couple of epochs

Unfortunately, even though with O2 I am able to train the loss is still nan right after the first epoch :(

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ImportError('/usr/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE')
2019-11-09 19:31:13,427 - INFO - __main__ - starting training
2019-11-09 19:31:13,427 - INFO - unet.train.trainer - starting training of model, going to train 100 epochs
2019-11-09 19:31:13,429 - INFO - unet.train.trainer - running epoch 1
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; average train loss nan; processed 40 batches in 10.27 seconds, 0.26 sec per batch on average
2019-11-09 19:31:23,699 - INFO - unet.train.trainer - epoch 1; starting validation
2019-11-09 19:31:26,067 - INFO - unet.train.trainer - epoch 1: validation loss nan
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - epoch 1: validation loss did not decrease, patience left 9
2019-11-09 19:31:26,068 - INFO - unet.train.trainer - running epoch 2
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
2019-11-09 19:31:36,441 - INFO - unet.train.trainer - epoch 2; average train loss nan; processed 40 batches in 10.37 seconds, 0.26 sec per batch on average
2019-11-09 19:31:36,442 - INFO - unet.train.trainer - epoch 2; starting validation
2019-11-09 19:31:38,790 - INFO - unet.train.trainer - epoch 2: validation loss nan
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - epoch 2: validation loss did not decrease, patience left 8
2019-11-09 19:31:38,791 - INFO - unet.train.trainer - running epoch 3
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.470329472543003e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0587911840678754e-22
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.6469779601696886e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.617444900424222e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6543612251060553e-24
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.1359030627651384e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0339757656912846e-25
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.2924697071141057e-26
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.2311742677852644e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.077935669463161e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.0194839173657902e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.524354896707238e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.1554436208840472e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.944304526105059e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.860761315262648e-32
2019-11-09 19:31:49,216 - INFO - unet.train.trainer - epoch 3; average train loss nan; processed 40 batches in 10.43 seconds, 0.26 sec per batch on average
2019-11-09 19:31:49,217 - INFO - unet.train.trainer - epoch 3; starting validation
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss nan
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - epoch 3: validation loss did not decrease, patience left 7
2019-11-09 19:31:51,595 - INFO - unet.train.trainer - running epoch 4
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.465190328815662e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.162975822039155e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.62964972193618e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.407412430484045e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.018531076210112e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.52316384526264e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.4039548065783e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.350988701644575e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.877471754111438e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.346839692639297e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.8367099231598242e-40
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4.591774807899561e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 5.739718509874451e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.4349296274686127e-42
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.793662034335766e-43
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; average train loss nan; processed 40 batches in 10.42 seconds, 0.26 sec per batch on average
2019-11-09 19:32:02,018 - INFO - unet.train.trainer - epoch 4; starting validation
2019-11-09 19:32:04,435 - INFO - unet.train.trainer - epoch 4: validation loss nan
2019-11-09 19:32:04,436 - INFO - unet.train.trainer - epoch 4: validation loss did not decrease, patience left 6

I have cuda 10.1.243-2, torchvision 0.4.2-3 and pytorch 1.3.0 installed.

@tastyminerals
Copy link
Author

@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others?
If so, could it be related to this issue?
If you are using CUDA10.0, could you update to 10.1, please, and check, if it's working?

I cannot reproduce the bug, the code below works fine on my machine.

torch.zeros((16*2**20 - 512)//2 + 1, 1, dtype=torch.float16, device='cuda:0') @ torch.zeros(1, 2, dtype=torch.float16, device='cuda:0')

@ptrblck
Copy link
Contributor

ptrblck commented Nov 10, 2019

@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?

@anjani-dhrangadhariya
Copy link

@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?

The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT.

@tastyminerals
Copy link
Author

tastyminerals commented Dec 29, 2019

@tastyminerals @someAdjectiveNoun
Could you try to post a (small) code snippet to reproduce this issue?

Unfortunately, you'd require custom dataset which we cannot share. We are using unet model.

@tastyminerals
Copy link
Author

tastyminerals commented Dec 29, 2019

I pulled recent apex master and reran the experiments. Now, previously working batch_size=1, opt-level=O2 stopped working and crashes right after the first epoch.
However, now there are some useful debug messages.

With O1:

Traceback (most recent call last):

  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
    scaled_loss.backward()
  File "/home/pavel/miniconda3/envs/gini_torch/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

With O3:

  File "train.py", line 343, in <module>
    main()
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 131, in train
    avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered
  In call to configurable 'train' (<function train at 0x7f0a8829b840>)

Prepending CUDA_LAUNCH_BLOCKING=1

  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
    scaled_loss.backward()
  File "/home/pavel/.local/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/pavel/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
  In call to configurable 'train' (<function train at 0x7fd08eb2e840>)

@tastyminerals
Copy link
Author

tastyminerals commented Dec 29, 2019

Here is the chunk of training code.

        for epoch in range(1, self.num_epochs + 1):

            logger.info(f"running epoch {epoch}")

            avg_train_loss = 0
            epoch_start_time = timeit.default_timer()

            # set model to training mode, validation switches some things like dropout off
            self.model.train() 

            for step, sample_batch in enumerate(self.train_data, start=1):
                sample_batch = self._sample_to_device(sample_batch)
                self.optimizer.zero_grad()

                doc_id_batch = sample_batch[DOC_ID]
                logits_dict = self.model(sample_batch)  # unet with 1 encoder and 1 decoder
                loss = self.criterion(logits_dict, sample_batch)  # SGD + momentum

                logger.debug(
                    f"epoch {epoch}: step {step}; loss {loss.item()}; doc ids {doc_id_batch}"
                )

                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()

                self.optimizer.step()

                avg_train_loss += loss.item()

            epoch_end_time = timeit.default_timer()
            epoch_time = epoch_end_time - epoch_start_time

            avg_train_loss /= len(self.train_data)

@jbartolozzi
Copy link

I pulled recent apex master and reran the experiments. Now, previously working batch_size=1, opt-level=O2 stopped working and crashes right after the first epoch.
However, now there are some useful debug messages.

With O1:

Traceback (most recent call last):

  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
    scaled_loss.backward()
  File "/home/pavel/miniconda3/envs/gini_torch/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/handle.py", line 127, in scale_loss
    should_skip = False if delay_overflow_check else loss_scaler.update_scale()
  File "/home/pavel/.local/lib/python3.7/site-packages/apex/amp/scaler.py", line 200, in update_scale
    self._has_overflow = self._overflow_buf.item()
RuntimeError: CUDA error: an illegal memory access was encountered

With O3:

  File "train.py", line 343, in <module>
    main()
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/pavel/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 131, in train
    avg_train_loss += loss.item()
RuntimeError: CUDA error: an illegal memory access was encountered
  In call to configurable 'train' (<function train at 0x7f0a8829b840>)

Prepending CUDA_LAUNCH_BLOCKING=1

  File "train.py", line 339, in main
    train()
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1073, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/pavel/.local/lib/python3.7/site-packages/gin/config.py", line 1050, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "train.py", line 269, in train
    trained_model_state, optimizer_state, metrics = trainer.train()
  File "/home/pavel/dev/gini/multi-modal-ner/mumo/train/trainer.py", line 127, in train
    scaled_loss.backward()
  File "/home/pavel/.local/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/pavel/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
  In call to configurable 'train' (<function train at 0x7fd08eb2e840>)

I'm getting the same results trying to run pix2pixHD training on Quadro RTX 6000

@tastyminerals
Copy link
Author

tastyminerals commented Jan 29, 2020

@jbartolozzi Quadro RTX 6000 has like 24GB of GPU memory? ... good lord. Did you try to use different batch sizes? Does it crash with batch_size = 1? Does it crash if you reduce the input image resolution?

@jbartolozzi
Copy link

With opt-level=00 there's no crashing.
These results are with a batch size of 1.

@tastyminerals
Copy link
Author

Yeah, the opt-level=O0 doesn't crash because it does not modify the model in any way. It's is a dry run. But looks like this ticket won't be solved in near future.

@tripzero
Copy link

as with @jbartolozzi I have tried pix2pixHD with cuda 10.2 and getting the same results.

@mcarilli
Copy link
Contributor

mcarilli commented Jul 7, 2020

#580 (comment) may be fixed by pytorch/pytorch#37569. The fix has been in master for a while, but did not make 1.5.1.

I still recommend moving to torch.cuda.amp. However, if the above PR is the right diagnosis, the problem is not in apex, but in Pytorch's FP16 gemv implementation, so you'll have to update Pytorch whether you choose apex or torch.cuda.amp.

@tripzero
Copy link

tripzero commented Jul 8, 2020

#580 (comment) may be fixed by pytorch/pytorch#37569. The fix has been in master for a while, but did not make 1.5.1.

I still recommend moving to torch.cuda.amp. However, if the above PR is the right diagnosis, the problem is not in apex, but in Pytorch's FP16 gemv implementation, so you'll have to update Pytorch whether you choose apex or torch.cuda.amp.

May try pytorch nightly builds as it look like 1.6 is just around the corner...

@tripzero
Copy link

tripzero commented Jul 8, 2020

Just tried pytorch 1.7.0 nightly. While I didn't get a CUBLAS_STATUS_EXECUTION_FAILED error, I did get the "Gradient overflow" and my GAN started producing black images :|.

@mcarilli
Copy link
Contributor

mcarilli commented Jul 8, 2020

Make sure you're following the guidance for multiple models/losses/optimizers. (retain_graph in that snippet is present because the two backward passes share some graph sections, it has nothing to do with amp. You may not need retain_graph for your own multi-model network.)

An example GAN training-loop step with proper torch.cuda.amp control flow can be found here, courtesy of @vfdev-5 (https://twitter.com/pytorch_ignite/status/1262721636844920832).

If that doesn't work, file an issue with a minimal repro on Pytorch github and tag me.

@seovchinnikov
Copy link

seovchinnikov commented Jul 9, 2020

Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)) on pytorch 1.4.x, 1.5.x

With the last nightly build (1.7.0.dev20200709), Cuda V10.1.243, apex master (1ff54b8) and cudnn 7.6.3_0 seems to be working fine (no overflows or segfaults) (using apex API) on cycleGAN (https://github.com/seovchinnikov/pytorch-CycleGAN-and-pix2pix)

@tripzero
Copy link

tripzero commented Jul 10, 2020

@mcarilli converted my training loop to use torch.cuda.amp instead of apex. It runs... but it doesn't seem like there's any indication that it's actually using 16 bit floats. Memory usage is identical as non-fp16 as is the speed. Do you know if there a way to verify amp is working with fp16 correctly?

Here's my modified code from pix2pixHD:

    amp_scaler = GradScaler(enabled=opt.fp16)
    with autocast(enabled=opt.fp16):

            ############## Forward Pass ######################
            losses, generated = model(Variable(data['label']), inst_map,
                                      Variable(data['image']), Variable(data['feat']), infer=save_fake)

            # sum per device losses
            losses = [torch.mean(x) if not isinstance(x, int)
                      else x for x in losses]
            loss_dict = dict(zip(model.module.loss_names, losses))

            # calculate final loss scalar
            loss_D = (loss_dict['D_fake'] + loss_dict['D_real']) * 0.5
            loss_G = loss_dict['G_GAN'] + \
                loss_dict.get('G_GAN_Feat', 0) + loss_dict.get('G_VGG', 0)

        ############### Backward Pass ####################
        # update generator weights
        optimizer_G.zero_grad()
        amp_scaler.scale(loss_G).backward()
        amp_scaler.step(optimizer_G)
        # if opt.fp16:
        #    with amp.scale_loss(loss_G, optimizer_G) as scaled_loss:
        #        scaled_loss.backward()
        # else:
        #    loss_G.backward()
        # optimizer_G.step()

        # update discriminator weights
        optimizer_D.zero_grad()
        amp_scaler.scale(loss_D).backward()
        amp_scaler.step(optimizer_D)

        amp_scaler.update()

update: Using DataParallel, I need to wrap forward of my module in @autocast. Works now... for a while and then I start getting nan losses :(.

@Jmennius
Copy link

We had a similar issue with another cuBLAS API (cublasSgemm()), although @anjani-dhrangadhariya also eperienced this.

CUDA Toolkit 11.1 release notes mention an issue fixed in cuBLAS:

Fixed an issue that caused an Address out of bounds error when calling cublasSgemm().

We had cublasSgemm() failing with CUBLAS_STATUS_EXECUTION_FAILED for us when built with 10.0 and running on Ampere GPU (3060 Ti). It ran fine on older GPUs (Pascal, Turing).
We had it run successfully on Ampere when we build it with CUDA 11.2.

Basically - try building against the newest CUDA Toolkit available and see if it helps.

P.S. this was with another framework/project, but should still be relevant.
P.P.S. related issue in PyTorch pytorch/pytorch#29795.

@ShoufaChen
Copy link
Contributor

Great thanks to this answer.

There may be a mismatch between the dimension of your input tensor and the dimensions of your nn.Linear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants