Include cublas error details when getting cublas handle fails #3695

jli · 2023-06-06T20:08:03Z

I've been getting hard-to-debug errors in some DeepSpeed runs. During initialization, one of the worker processes raises RuntimeError: Fail to create cublas handle. with no further details, which feels pretty mysterious.

This change includes details of the failure status by using https://docs.nvidia.com/cuda/cublas/#cublasgetstatusname and https://docs.nvidia.com/cuda/cublas/#cublasgetstatusstring

original error message (using deepspeed 0.9.2): RuntimeError: Fail to create cublas handle.

new error message with this change: RuntimeError: Failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED the library was not initialized

This is still not a great error message, but it has better search results (most results suggest that it's due to running out of GPU memory; bizarrely some people also report removing ~/.nv fixes it...).

jli · 2023-06-06T21:43:23Z

@microsoft-github-policy-service agree

jli · 2023-06-07T18:00:46Z

CI checks are failing because the CI environment is using a version of CUDA/cuBLAS that predates these functions. cublasGetStatusName and cublasGetStatusString were added in CUDA 11.4.2 (released in late 2021 I believe).

@loadams It seems like DeepSpeed doesn't specify any minimum version of CUDA, so I'm guessing you'd rather not include this change. If not, maybe we could instead just include the raw enum number, which the user could look up manually?

details

Example error:

/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h(56): error: identifier "cublasGetStatusName" is undefined

According to the ds_report output, the CI checks use these versions:

nv-accelerate-v100: torch cuda version 11.7, nvcc 11.1
nv-torch19-p40: torch cuda version 11.1, nvcc 11.1
nv-torch19-v100: torch cuda version 11.1, nvcc 11.1

I'm not sure why nv-accelerate-v100 still failed with cuda 11.7, but maybe it's because it's using nvcc 11.1.

(the nv-inference failure seems unrelated? TypeError: can't assign a NoneType to a torch.cuda.HalfTensor)

loadams · 2023-06-07T22:36:34Z

Thanks for the info @jli.

cc: @mrwyattii and @jeffra for FYI. At least until we got a min CUDA version, I think it would make sense to at least print the raw enum number to add more debug info for users.

…rror-detail

include cublas error details when getting cublas handle fails

d34b0f0

jli requested review from RezaYazdaniAminabadi, awan-10, jeffra, cmikeh2 and arashb as code owners June 6, 2023 20:08

Merge branch 'master' into jli/cublas-create-error-detail

73581d5

loadams approved these changes Jun 7, 2023

View reviewed changes

loadams changed the title ~~include cublas error details when getting cublas handle fails~~ Include cublas error details when getting cublas handle fails Jun 7, 2023

run clang-format

672c622

jli and others added 4 commits June 9, 2023 16:24

just use raw enum value to avoid depending on minimum cuda version

32049b8

Merge remote-tracking branch 'origin/master' into jli/cublas-create-e…

21ccb52

…rror-detail

Merge branch 'master' into jli/cublas-create-error-detail

1bdd781

Merge branch 'master' into jli/cublas-create-error-detail

1b5fd3b

loadams enabled auto-merge (squash) June 13, 2023 18:32

loadams merged commit 46bb08c into deepspeedai:master Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include cublas error details when getting cublas handle fails #3695

Include cublas error details when getting cublas handle fails #3695

jli commented Jun 6, 2023

jli commented Jun 6, 2023

jli commented Jun 7, 2023

loadams commented Jun 7, 2023

Include cublas error details when getting cublas handle fails #3695

Include cublas error details when getting cublas handle fails #3695

Conversation

jli commented Jun 6, 2023

jli commented Jun 6, 2023

jli commented Jun 7, 2023

details

loadams commented Jun 7, 2023