-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the error messages of paddle CUDA API #23816
Conversation
070c6ad
to
f786612
Compare
507cb31
to
7cff8fc
Compare
7cff8fc
to
aa4ba28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在的报错信息都是统一格式的:
错误类型:关键错误提示.
[附加提示:XXX] at (出错文件:行号)
[出错Op(如果有的话)]
所以我觉得下面这种格式是不是也能统一下
--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.
Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run. at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)
改成
--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.
[Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)
另外我感觉这个Recommended Solution
题目有点太长了,能不能就叫Solution或者Advice或者别的一个词
这里其实涉及到一个接口的边界问题,是迟早需要解决的:PADDLE_ENFORCE_CUDA_SUCCESS这个宏的概念边界有歧义了,一个报错检查宏,在设计上最好确定地属于下面某一种:
但目前在此PR这项非常棒的自动化填充报错的功能整合后,PADDLE_ENFORCE_CUDA_SUCCESS这个宏在cuda类报错上是确保报错类型和信息没问题的,但是在cudnn, cublas, curand这些类型上,又需要开发者确保报错类型和信息正确,这不是一个边界清晰的设计 这个改进方向是确定的,就是要让检查宏的边界清晰,有两个方向:
我个人倾向于方向1,定义新宏,原因如下:
采用新宏的话,实现大概是,例如(名字可以改的更好些):
|
请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平,即确保报错类型和信息没问题的,是不是函数封装和文案设计进行统一后就行了,不需要一定拿到NVIDA官网来的详细信息呢? 如果是这样,能不能一次性考虑到方向2来解决? |
这个是说先简单封装下另外几种库,一次性将PADDLE_ENFORCE_CUDA_SUCCESS改成不需要类型和信息的吗?那2.0应该来不及了,因为这需要把paddle里面的所有PADDLE_ENFORCE_CUDA_SUCCESS都改了,而且CI也需要更新下规则 赶2.1再上的话,我觉得这样也OK |
讨论结果,以方案2实施,尽量赶上2.0 |
…blas/NCCL,test=develop
5a01350
to
69c4796
Compare
f09b0f7
to
5f39488
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实可以暂时不删除build_ex_string, 新创建build_nvidia_error_msg函数,因为删除build_ex_string的话,目前paddle中使用PADDLE_ENFORCE检查cuda类错误时可能会出错,但是目前不知道paddle还有多少个这样的历史遗留检查
可以看下这个PR,#21994
paddle/fluid/platform/errors.h
Outdated
@@ -33,6 +33,9 @@ class ErrorSummary { | |||
// Note(chenweihang): Final deprecated constructor | |||
// This constructor is only used to be compatible with | |||
// current existing no error message PADDLE_ENFORCE_* | |||
// Note(zhouwei): PADDLE_ENFORCE_CUDA_SUCCESS error message | |||
// can be get automatically, error message from developer | |||
// is not necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段应该可以去掉了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5789104
to
3c2889a
Compare
3c2889a
to
81abef7
Compare
81abef7
to
adfc8ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* cherry-pick of DeviceContext Split, test=develop (#23737) * New feature: thread local allocator, test=develop (#23989) * add the thread_local_allocator, test=develop * refactor the thread_local_allocator, test=develop * provides option setting strategy, test=develop * add boost dependency to cuda_stream, test=develop * declare the stream::Priority as enum class, test=develop * deal with PADDLE_ENFORCE_CUDA_SUCCESS macro in pr #23816
当前问题
目前Nvidia相关API报错信息格式:
1. 需要用户在Nvidia网站里,自行查找,但是该网址国内用户很难访问,而且网站内容多,也很难找到对应之处,一般用户也不会点击。
2. 各种
cudaFunction failed!
函数调用失败的信息,对用户参考的价值低,而且较难理解;因此报错信息较为不友好,用户出现Nvidia相关API问题无法自行分析。存在问题较大,issue众多(总计会有10几个以上)。
升级方案
1. 重构了
PADDLE_ENFORCE_CUDA_SUCCESS
,开发者直接调用PADDLE_ENFORCE_CUDA_SUCCESS(error)
即可,error
可以是cudaError_t(cudaAPI)
、curandStatus_t(curandAPI)
、cudnnStatus_t(cudnnAPI)
、cublasStatus_t(cublasAPI)
、ncclResult_t(ncclAPI)
五种API的任意一种,涉及Paddle中314个Nvidia相关 API,不再由开发者手动输入,因为开发者对于Nvidia相关的API的外部错误,也无法给出有效的信息,一般为:cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl
cudaEventRecord raises unexpected exception
等函数调用失败的信息,而该信息可以在栈信息C++ call Stack中查看,无需放在最关键的Error Summary中暴露给用户,对用户无实质帮助且形成了理解负担;
2. 新的报错信息由系统根据错误码自动填充,是通过爬虫从Nvidia官网爬取,或根据
ncclGetErrorString
等API自动获取,另外统一了五种NvidiaAPI报错信息最终格式;报错预览
修改前
1. CUDA API:
--------------------------------------------
Error Message Summary:
--------------------------------------------
Error: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 35, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (D:\1.6.2\paddle\paddle\fluid\platform\gpu_info.cc:67)
修改后
1. CUDA API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cuda error(35), CUDA driver version is insufficient for CUDA runtime version.
[Advise: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1244)
2. CURAND API:
----------------------
Error Message Summary:
----------------------
ExternalError: Curand error, CURAND_STATUS_OUT_OF_RANGE : unspecified launch failure at (/Paddle/paddle/fluid/pybind/pybind.cc:1247)
3. CUDNN API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cudnn error, CUDNN_STATUS_INTERNAL_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1250)
4. CUBLAS API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cublas error, CUBLAS_STATUS_LICENSE_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1253)
5. NCCL API:
----------------------
Error Message Summary:
----------------------
ExternalError: Nccl error, unhandled system error at (/Paddle/paddle/fluid/pybind/pybind.cc:1256)
注:第1种API加入了爬虫,有100多种错误码,并提供从Nvidia官网爬取到的详细信息;后4种API错误码较少,未引入爬虫,只提供简要的信息,竞品目前只提供最简要信息,后期可视情况看Paddle是否需要将详细信息也进行爬取;
调用方式举例:
报错信息自动产生,进行了全封装,开发者只需传入Nvidia API的返回值,调用简单;
PADDLE_ENFORCE_CUDA_SUCCESS(cudaGetDeviceCount(&count));