-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx when using deepspeed tranformer kernel #294
Comments
If I disable fp16, I get this error after the same python stack trace:
|
Hi @tyler-romero, Could you please try running one of the test units that we have for the kernels, like "DeepSpeed/tests/unit/test_cuda_forward.py"? This files includes several tests for our kernel with different batch, sequence and hidden-dim and both for fp16 and fp32. From the result of this test, we can see if the problem is on the kernel side. Since from your log, I am seeing that the error comes from a matmul initiated in line 207 of your modeling file! Is this executed after calling the transformer kernel? I wonder if there might be any size mismatch when calling the Cublas library. You can also try printing the shape of the tensors, such as input and self.weight. Best regards, |
Encountered same issue after a few thousand iterations:
|
Hi yselivonchyk, That's so odd that this happens after many iterations! Best regards, |
@RezaYazdaniAminabadi true that the RuntumeError in the stack trace occurs at the matmul, but the I'm pretty occupued at work right now, but hopefully in the somewhat near future I can try printing those tensor shapes and getting a repro. The modeling file I'm using is a very slightly modified version of the one in the BingBertSquad example. |
Hi Tyler, We have seen the same issues with some other benchmarks too! Normally, this happens when there is high pressure on the memory and the cublas gemm crashes due to this error: CUBLAS_STATUS_EXECUTION_FAILED=13. I wonder if you can verify this on your end by printing the error number at the same file you've mentioned. Also, can you tell me what memory consumption you will have when running the training? If that's close to the peak of the total memory on GPU, there might be the risk of crashing due to memory allocation issues. Best regards, |
I've since updated to Deepspeed V0.3, so now the printout is a bit better. It does seem that the error=13. I was using a 24 transformer block model, with a batch size of 32 on each gpu. I cut the number of layers to 6 and the batch size to 16, but I still see the same error. I dont have the % of memory calculated, but the 24 layer 32 batch size params worked just fine with the pytorch implementation of a transformer block that exists in bing_bert. I would think that cutting the number of layers by 4 and the batch size by 2 would take away the memory pressure. Is there any other reason why error=13 might be thrown?
Edit:
I'm running this on P100s, so 16gb gpu memory each. Seems like only 0.573gb are being used here. |
Hi @tyler-romero, Thanks for trying this again. I see that the dimensions are all okay and it is very strange why the Cublas GeMM library is giving this error. I wonder if this is the only error you are getting or is there anything else maybe hidden in your log. Regarding the error 13 for the cublas GEMM, the only explanation I could find is the bellow one, which is not so much helpful!
I have also made a PR to fix a similar issue in some of our examples for BingBertSquad: microsoft/DeepSpeedExamples#58 I wonder if you pass the local_rank argument when running with the transformer kernel? Also, I have seen that this error is sometimes due to having the parameters in FP16 and the input is passed in as FP32. I think the P100 architecture does not support fp16. Could you please check this? If everything is still fine with your test environment, I wonder if you have the option to run this test on a different GPU hardware, just to rule out the hardware issue! Thanks, |
Hi, thanks for the response Reza, I am passing in local_rank to the transformer kernel (was following the modeling example from bing_bert). Also I double checked the docs and can confirm that P100 does support fp16. I will double check that the input matches the parameter precision soon. I will also double check my docker file to see if things are installed correctly. I am also doing something a bit unusual when launching training. I am a Microsoft employee, so could we discuss this method of launching deepspeed offline so I can share my code? |
Thanks in advance for checking the configuration and also the data types. Reza |
Issue fixed offline. Problem was with the specific GPU architecture. After more testing we noticed it worked fine on V100s, but not P100s. This PR contains the fix. Thanks Reza and Jeff! |
Hi, I'm running into the following error when attempting to train with the deepspeed transformer kernel.
This error occurs during the forward pass of the first training step.
!!!! kernel execution error.
is printed for 80 lines, followed by this traceback:If I disable the use of the deepspeed transformer kernel, everything works just fine.
I'm using a slightly modified version of the provided dockerfile:
My deepspeed config looks like this:
This issue seems to indicate it may be a bug in the versions of Cuda and PyTorch that are used:
pytorch/pytorch#24018
And this one indicates that it may have to do with fp16 casting:
NVIDIA/apex#580
Any help would be appriciated! Thanks.
The text was updated successfully, but these errors were encountered: