Optimize the error messages of paddle CUDA API #23816

zhwesky2010 · 2020-04-13T15:19:12Z

当前问题

目前Nvidia相关API报错信息格式：
1. 需要用户在Nvidia网站里，自行查找，但是该网址国内用户很难访问，而且网站内容多，也很难找到对应之处，一般用户也不会点击。
2. 各种 cudaFunction failed！ 函数调用失败的信息，对用户参考的价值低，而且较难理解；

因此报错信息较为不友好，用户出现Nvidia相关API问题无法自行分析。存在问题较大，issue众多（总计会有10几个以上）。

升级方案

1. 重构了PADDLE_ENFORCE_CUDA_SUCCESS，开发者直接调用PADDLE_ENFORCE_CUDA_SUCCESS(error)即可，error可以是cudaError_t(cudaAPI)、curandStatus_t(curandAPI)、cudnnStatus_t(cudnnAPI)、cublasStatus_t(cublasAPI)、ncclResult_t(ncclAPI)五种API的任意一种，涉及Paddle中314个Nvidia相关 API，不再由开发者手动输入，因为开发者对于Nvidia相关的API的外部错误，也无法给出有效的信息，一般为：

cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl
cudaEventRecord raises unexpected exception

等函数调用失败的信息，而该信息可以在栈信息C++ call Stack中查看，无需放在最关键的Error Summary中暴露给用户，对用户无实质帮助且形成了理解负担；

2. 新的报错信息由系统根据错误码自动填充，是通过爬虫从Nvidia官网爬取，或根据ncclGetErrorString等API自动获取，另外统一了五种NvidiaAPI报错信息最终格式；

报错预览

修改前

1. CUDA API：
--------------------------------------------
Error Message Summary:
--------------------------------------------
Error: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 35, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (D:\1.6.2\paddle\paddle\fluid\platform\gpu_info.cc:67)

修改后

1. CUDA API：
----------------------
Error Message Summary:
----------------------
ExternalError: Cuda error(35), CUDA driver version is insufficient for CUDA runtime version.
[Advise: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1244)

2. CURAND API:
----------------------
Error Message Summary:
----------------------
ExternalError: Curand error, CURAND_STATUS_OUT_OF_RANGE : unspecified launch failure at (/Paddle/paddle/fluid/pybind/pybind.cc:1247)

3. CUDNN API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cudnn error, CUDNN_STATUS_INTERNAL_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1250)

4. CUBLAS API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cublas error, CUBLAS_STATUS_LICENSE_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1253)

5. NCCL API:
----------------------
Error Message Summary:
----------------------
ExternalError: Nccl error, unhandled system error at (/Paddle/paddle/fluid/pybind/pybind.cc:1256)

注：第1种API加入了爬虫，有100多种错误码，并提供从Nvidia官网爬取到的详细信息；后4种API错误码较少，未引入爬虫，只提供简要的信息，竞品目前只提供最简要信息，后期可视情况看Paddle是否需要将详细信息也进行爬取；

调用方式举例：

报错信息自动产生，进行了全封装，开发者只需传入Nvidia API的返回值，调用简单；
PADDLE_ENFORCE_CUDA_SUCCESS(cudaGetDeviceCount(&count));

chenwhql

现在的报错信息都是统一格式的：

错误类型：关键错误提示.
  [附加提示：XXX] at (出错文件:行号)
  [出错Op(如果有的话)]

所以我觉得下面这种格式是不是也能统一下

--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.

Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run. at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)

改成

--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.
  [Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)

另外我感觉这个Recommended Solution题目有点太长了，能不能就叫Solution或者Advice或者别的一个词

chenwhql · 2020-04-15T13:33:50Z

这里其实涉及到一个接口的边界问题，是迟早需要解决的：PADDLE_ENFORCE_CUDA_SUCCESS这个宏的概念边界有歧义了，一个报错检查宏，在设计上最好确定地属于下面某一种：

使用这个宏，需要开发者确保报错类型和报错信息的准确性，例如PADDLE_THROW, PADDLE_ENFORCE_EQ/NOT_NULL等
使用这个宏，不需要开发者关心报错类型和报错信息，宏将确保这两点的正确性，开发者只需要按要求填空就行，这类宏例如OP_INOUT_CHECK, GET_DATA_SAFELY

但目前在此PR这项非常棒的自动化填充报错的功能整合后，PADDLE_ENFORCE_CUDA_SUCCESS这个宏在cuda类报错上是确保报错类型和信息没问题的，但是在cudnn, cublas, curand这些类型上，又需要开发者确保报错类型和信息正确，这不是一个边界清晰的设计

这个改进方向是确定的，就是要让检查宏的边界清晰，有两个方向：

添加新的检查宏，用于包装这套自动化cuda报错填充检查逻辑，至于PADDLE_ENFORCE_CUDA_SUCCESS仍然保持原来的边界，即需要开发者确保报错合规性，最后可以删除
对cudnn, cublas, curand这些类型也支持自动填充合规报错类型和信息，让PADDLE_ENFORCE_CUDA_SUCCESS这个宏彻底变成一个不需要开发者关心报错类型和内容的宏

我个人倾向于方向1，定义新宏，原因如下：

定义新宏，虽然会引入新的检查，但这样的宏使用极为简便，并没有太多推动成本，而且过程中没有兼容性问题
方向2的问题：
- 目前只支持了cuda，cudnn, cublas, curand仍有待支持，那势必存在一段时间，对于开发者来说，这个宏的概念是不清晰的，开发者在使用的时候混乱的话，我们很可能被诟病，而且开发者按照此PR里面的写法写了，是合规的，但是会被CI卡主，我们也会被诟病
- 等所有都支持完之后，还要向开发者解释，这个宏的用法变了，这也是很麻烦的事，不入直接上新宏

采用新宏的话，实现大概是，例如（名字可以改的更好些）：

// 待实现
inline std::string get_nvida_error_msg(cudnnStatus_t e) {
  return GetCudnnErrorMessage(e);
}

// 待实现
inline std::string get_nvida_error_msg(curandStatus_t e) {
  return GetCurandErrorMessage(e);
}

// 待实现
inline std::string get_nvida_error_msg(cublasStatus_t e) {
  return GetCublasErrorMessage(e);
}

// 已经实现
inline std::string get_nvida_error_msg(cudaError_t e) {
  return GetCudaErrorMessage(e);
}

#ifdef PADDLE_WITH_CUDA
#define CUDA_SUCCESS_CHECK(COND)                                             \
  do {                                                                       \
    auto __cond__ = (COND);                                                  \
    using __CUDA_STATUS_TYPE__ = decltype(__cond__);                         \
    constexpr auto __success_type__ =                                        \
        ::paddle::platform::details::CudaStatusType<                         \
            __CUDA_STATUS_TYPE__>::kSuccess;                                 \
    if (UNLIKELY(__cond__ != __success_type__)) {                            \
      try {                                                                  \
        ::paddle::platform::throw_on_error(                                  \
            __cond__, ::paddle::platform::errors::External(                  \
                          ::paddle::platform::get_nvida_error_msg(__cond__)) \
                          .ToString());                                      \
      } catch (...) {                                                        \
        HANDLE_THE_ERROR                                                     \
        throw ::paddle::platform::EnforceNotMet(std::current_exception(),    \
                                                __FILE__, __LINE__);         \
        END_HANDLE_THE_ERROR                                                 \
      }                                                                      \
    }                                                                        \
  } while (0)
#endif  // PADDLE_WITH_CUDA

luotao1 · 2020-04-15T14:30:00Z

目前只支持了cuda，cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平，即确保报错类型和信息没问题的，是不是函数封装和文案设计进行统一后就行了，不需要一定拿到NVIDA官网来的详细信息呢？

如果是这样，能不能一次性考虑到方向2来解决？

chenwhql · 2020-04-15T14:35:13Z

目前只支持了cuda，cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平，即确保报错类型和信息没问题的，是不是函数封装和文案设计进行统一后就行了，不需要一定拿到NVIDA官网来的详细信息呢？

如果是这样，能不能一次性考虑到方向2来解决？

这个是说先简单封装下另外几种库，一次性将PADDLE_ENFORCE_CUDA_SUCCESS改成不需要类型和信息的吗？那2.0应该来不及了，因为这需要把paddle里面的所有PADDLE_ENFORCE_CUDA_SUCCESS都改了，而且CI也需要更新下规则

赶2.1再上的话，我觉得这样也OK

zhwesky2010 · 2020-04-16T03:30:27Z

目前只支持了cuda，cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平，即确保报错类型和信息没问题的，是不是函数封装和文案设计进行统一后就行了，不需要一定拿到NVIDA官网来的详细信息呢？
如果是这样，能不能一次性考虑到方向2来解决？

这个是说先简单封装下另外几种库，一次性将PADDLE_ENFORCE_CUDA_SUCCESS改成不需要类型和信息的吗？那2.0应该来不及了，因为这需要把paddle里面的所有PADDLE_ENFORCE_CUDA_SUCCESS都改了，而且CI也需要更新下规则

赶2.1再上的话，我觉得这样也OK

讨论结果，以方案2实施，尽量赶上2.0

…blas/NCCL,test=develop

chenwhql

其实可以暂时不删除build_ex_string, 新创建build_nvidia_error_msg函数，因为删除build_ex_string的话，目前paddle中使用PADDLE_ENFORCE检查cuda类错误时可能会出错，但是目前不知道paddle还有多少个这样的历史遗留检查

可以看下这个PR，#21994

chenwhql · 2020-04-17T04:05:23Z

paddle/fluid/platform/errors.h

@@ -33,6 +33,9 @@ class ErrorSummary {
  // Note(chenweihang): Final deprecated constructor
  //   This constructor is only used to be compatible with
  //   current existing no error message PADDLE_ENFORCE_*
+  // Note(zhouwei): PADDLE_ENFORCE_CUDA_SUCCESS error message
+  //   can be get automatically, error message from developer
+  //   is not necessary


这段应该可以去掉了

chenwhql

Excellent!

liupluswei

LGTM

… cuda_error

raindrops2sea

LGTM

…23816)',test=develop

* cherry-pick of DeviceContext Split, test=develop (#23737) * New feature: thread local allocator, test=develop (#23989) * add the thread_local_allocator, test=develop * refactor the thread_local_allocator, test=develop * provides option setting strategy, test=develop * add boost dependency to cuda_stream, test=develop * declare the stream::Priority as enum class, test=develop * deal with PADDLE_ENFORCE_CUDA_SUCCESS macro in pr #23816

zhwesky2010 force-pushed the cuda_error1 branch 4 times, most recently from 070c6ad to f786612 Compare April 14, 2020 07:10

zhwesky2010 mentioned this pull request Apr 14, 2020

Optimize the error messages of paddle CUDA API #23844

Closed

zhwesky2010 force-pushed the cuda_error1 branch 2 times, most recently from 507cb31 to 7cff8fc Compare April 14, 2020 09:11

zhwesky2010 mentioned this pull request Apr 14, 2020

[cherry-pick2.0]Optimize the error messages of paddle CUDA API #23849

Merged

Optimize the error messages of paddle CUDA API, test=develop

aa4ba28

zhwesky2010 force-pushed the cuda_error1 branch from 7cff8fc to aa4ba28 Compare April 15, 2020 07:47

chenwhql reviewed Apr 15, 2020

View reviewed changes

fix the error messages of paddle CUDA API, test=develop

cd0e7ba

zhwesky2010 closed this Apr 15, 2020

zhwesky2010 reopened this Apr 15, 2020

Refactoring PADDLE_ENFORCE_CUDA_SUCCESS, and apply to curand/cudnn/cu…

69c4796

…blas/NCCL,test=develop

zhwesky2010 force-pushed the cuda_error1 branch from 5a01350 to 69c4796 Compare April 16, 2020 13:39

merge develop

5f39488

zhwesky2010 force-pushed the cuda_error1 branch from f09b0f7 to 5f39488 Compare April 16, 2020 14:55

chenwhql reviewed Apr 17, 2020

View reviewed changes

chenwhql previously approved these changes Apr 17, 2020

View reviewed changes

liupluswei previously approved these changes Apr 17, 2020

View reviewed changes

zhwesky2010 dismissed stale reviews from liupluswei and chenwhql via 5789104 April 17, 2020 14:44

zhwesky2010 force-pushed the cuda_error1 branch from 5789104 to 3c2889a Compare April 17, 2020 15:33

remove build_ex_string,test=develop

8641817

zhwesky2010 force-pushed the cuda_error1 branch from 3c2889a to 81abef7 Compare April 18, 2020 15:08

zhwesky2010 added 2 commits April 18, 2020 15:38

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

05d718d

… cuda_error

merge conflict,test=develop

adfc8ec

zhwesky2010 force-pushed the cuda_error1 branch from 81abef7 to adfc8ec Compare April 18, 2020 15:44

chenwhql approved these changes Apr 19, 2020

View reviewed changes

liupluswei approved these changes Apr 20, 2020

View reviewed changes

raindrops2sea approved these changes Apr 20, 2020

View reviewed changes

liupluswei merged commit 7817003 into PaddlePaddle:develop Apr 20, 2020

This was referenced Apr 20, 2020

fix conv_fusion_op conflict,test=develop #24020

Merged

Revert "Optimize the error messages of paddle CUDA API" #24058

Closed

zhwesky2010 added a commit to zhwesky2010/Paddle that referenced this pull request Apr 21, 2020

Revert 'Optimize the error messages of paddle CUDA API (PaddlePaddle#…

6f59331

…23816)',test=develop

Shixiaowei02 added a commit to Shixiaowei02/Paddle that referenced this pull request Apr 22, 2020

deal with PADDLE_ENFORCE_CUDA_SUCCESS macro in pr PaddlePaddle#23816

bb6690a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the error messages of paddle CUDA API #23816

Optimize the error messages of paddle CUDA API #23816

zhwesky2010 commented Apr 13, 2020 •

edited

Loading

chenwhql left a comment

chenwhql commented Apr 15, 2020

luotao1 commented Apr 15, 2020

chenwhql commented Apr 15, 2020

zhwesky2010 commented Apr 16, 2020 •

edited

Loading

chenwhql left a comment

chenwhql Apr 17, 2020

zhwesky2010 Apr 17, 2020

chenwhql left a comment

liupluswei left a comment

raindrops2sea left a comment

Optimize the error messages of paddle CUDA API #23816

Optimize the error messages of paddle CUDA API #23816

Conversation

zhwesky2010 commented Apr 13, 2020 • edited Loading

当前问题

升级方案

报错预览

修改前

修改后

调用方式举例：

chenwhql left a comment

Choose a reason for hiding this comment

chenwhql commented Apr 15, 2020

luotao1 commented Apr 15, 2020

chenwhql commented Apr 15, 2020

zhwesky2010 commented Apr 16, 2020 • edited Loading

chenwhql left a comment

Choose a reason for hiding this comment

chenwhql Apr 17, 2020

Choose a reason for hiding this comment

zhwesky2010 Apr 17, 2020

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

liupluswei left a comment

Choose a reason for hiding this comment

raindrops2sea left a comment

Choose a reason for hiding this comment

zhwesky2010 commented Apr 13, 2020 •

edited

Loading

zhwesky2010 commented Apr 16, 2020 •

edited

Loading