-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx after couple of first epochs #580
Comments
When in doubt, always prefer casting to FP32. In this case (I think) you're calling into a custom torchvision op that may not have an FP16 implementation. Cast both inputs to FP32 instead of FP16 and it should work. |
I casted everything to float32 rois = roi_pool(
input=classification_feature_map_tensor.float(),
boxes=box_coord_tensor.float(),
output_size=self.roi_size,
spatial_scale=1,
) The with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward() # exception is thrown inside the training loop below for epoch in range(1, self.num_epochs + 1):
logger.info(f"running epoch {epoch}")
avg_train_loss = 0
self.model.train()
for step, sample_batch in enumerate(self.train_data, start=1):
sample_batch = self._sample_to_device(sample_batch)
self.optimizer.zero_grad()
doc_id_batch = sample_batch[DOC_ID]
logits_dict = self.model(sample_batch)
loss = self.criterion(logits_dict, sample_batch)
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward() # exception is thrown
self.optimizer.step()
avg_train_loss += loss.item()
epoch_end_time = timeit.default_timer()
epoch_time = epoch_end_time - epoch_start_time |
Below are some training logs with
Now with
Running the training with
|
Could it be related to this? So does it mean that we are running out of memory? But
|
I don't think it's running out of memory. With O1, for the backward pass (#580 (comment)) does it error on the very first backward pass? And what is the exception trace that is thrown? |
Correct, with
|
According to these docs Batch size does not change the behavior. I also tried running with nightly pytorch builds, same results. Tried running on different machines GTX1070 and GTX1080Ti, same error. The apex example imagenet network runs without errors though so it is something with our model. |
@tastyminerals Are you using variable input sizes, i.e. are some inputs larger than others? |
I get a similar error with the forward pass. After some batches, it gives the following error(s). Sometimes it is error 1 and sometimes it is error 2 or error 3. Error 1 Error 2 Error 3 Maybe this issue discussion can bring more perspective to it. |
I managed to train the model without crashing (at least reach 10th epoch) with
Unfortunately, even though with
I have |
I cannot reproduce the bug, the code below works fine on my machine.
|
@tastyminerals @someAdjectiveNoun |
The problem is solved now. How? The problem was actually caused by using BioBERT model that I used. Using the BERT in Pytorch works smoothly. The problem seems to be coming from BioBERT. |
Unfortunately, you'd require custom dataset which we cannot share. We are using unet model. |
I pulled recent apex master and reran the experiments. Now, previously working With
With
Prepending
|
Here is the chunk of training code. for epoch in range(1, self.num_epochs + 1):
logger.info(f"running epoch {epoch}")
avg_train_loss = 0
epoch_start_time = timeit.default_timer()
# set model to training mode, validation switches some things like dropout off
self.model.train()
for step, sample_batch in enumerate(self.train_data, start=1):
sample_batch = self._sample_to_device(sample_batch)
self.optimizer.zero_grad()
doc_id_batch = sample_batch[DOC_ID]
logits_dict = self.model(sample_batch) # unet with 1 encoder and 1 decoder
loss = self.criterion(logits_dict, sample_batch) # SGD + momentum
logger.debug(
f"epoch {epoch}: step {step}; loss {loss.item()}; doc ids {doc_id_batch}"
)
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
self.optimizer.step()
avg_train_loss += loss.item()
epoch_end_time = timeit.default_timer()
epoch_time = epoch_end_time - epoch_start_time
avg_train_loss /= len(self.train_data) |
I'm getting the same results trying to run pix2pixHD training on Quadro RTX 6000 |
@jbartolozzi Quadro RTX 6000 has like 24GB of GPU memory? ... good lord. Did you try to use different batch sizes? Does it crash with |
With opt-level=00 there's no crashing. |
Yeah, the |
as with @jbartolozzi I have tried pix2pixHD with cuda 10.2 and getting the same results. |
#580 (comment) may be fixed by pytorch/pytorch#37569. The fix has been in master for a while, but did not make 1.5.1. I still recommend moving to torch.cuda.amp. However, if the above PR is the right diagnosis, the problem is not in apex, but in Pytorch's FP16 gemv implementation, so you'll have to update Pytorch whether you choose apex or |
May try pytorch nightly builds as it look like 1.6 is just around the corner... |
Just tried pytorch 1.7.0 nightly. While I didn't get a CUBLAS_STATUS_EXECUTION_FAILED error, I did get the "Gradient overflow" and my GAN started producing black images :|. |
Make sure you're following the guidance for multiple models/losses/optimizers. ( An example GAN training-loop step with proper If that doesn't work, file an issue with a minimal repro on Pytorch github and tag me. |
Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling With the last nightly build (1.7.0.dev20200709), Cuda V10.1.243, apex master (1ff54b8) and cudnn 7.6.3_0 seems to be working fine (no overflows or segfaults) (using apex API) on cycleGAN (https://github.com/seovchinnikov/pytorch-CycleGAN-and-pix2pix) |
@mcarilli converted my training loop to use torch.cuda.amp instead of apex. It runs... but it doesn't seem like there's any indication that it's actually using 16 bit floats. Memory usage is identical as non-fp16 as is the speed. Do you know if there a way to verify amp is working with fp16 correctly? Here's my modified code from pix2pixHD: amp_scaler = GradScaler(enabled=opt.fp16)
with autocast(enabled=opt.fp16):
############## Forward Pass ######################
losses, generated = model(Variable(data['label']), inst_map,
Variable(data['image']), Variable(data['feat']), infer=save_fake)
# sum per device losses
losses = [torch.mean(x) if not isinstance(x, int)
else x for x in losses]
loss_dict = dict(zip(model.module.loss_names, losses))
# calculate final loss scalar
loss_D = (loss_dict['D_fake'] + loss_dict['D_real']) * 0.5
loss_G = loss_dict['G_GAN'] + \
loss_dict.get('G_GAN_Feat', 0) + loss_dict.get('G_VGG', 0)
############### Backward Pass ####################
# update generator weights
optimizer_G.zero_grad()
amp_scaler.scale(loss_G).backward()
amp_scaler.step(optimizer_G)
# if opt.fp16:
# with amp.scale_loss(loss_G, optimizer_G) as scaled_loss:
# scaled_loss.backward()
# else:
# loss_G.backward()
# optimizer_G.step()
# update discriminator weights
optimizer_D.zero_grad()
amp_scaler.scale(loss_D).backward()
amp_scaler.step(optimizer_D)
amp_scaler.update() update: Using DataParallel, I need to wrap forward of my module in @autocast. Works now... for a while and then I start getting nan losses :(. |
We had a similar issue with another cuBLAS API (cublasSgemm()), although @anjani-dhrangadhariya also eperienced this. CUDA Toolkit 11.1 release notes mention an issue fixed in cuBLAS:
We had Basically - try building against the newest CUDA Toolkit available and see if it helps. P.S. this was with another framework/project, but should still be relevant. |
Great thanks to this answer. There may be a mismatch between the dimension of your input tensor and the dimensions of your |
I am training a version of unet with joint classification and semantic segmentation using
O1
level. The training crashes after I explicitly castbox_coord_tensor
inroi_pool
function.Thing is,
classification_feature_map_tensor
comes as float16 since it is handled by amp whilebox_coord_tensor
comes from input batch which is float32. However,roi_pool
requires tensors to have equal precision and throwsBut if I cast
box_coord_tensor
to float16, CUDA throws memory access error below.Is there anything I could try to do because so far any attempts result in the error above.
The text was updated successfully, but these errors were encountered: