Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCM] update fluid platform for rocm35 (part1), test=develop #30639

Merged
merged 2 commits into from
Jan 28, 2021

Conversation

qili93
Copy link
Contributor

@qili93 qili93 commented Jan 21, 2021

PR types

New features

PR changes

Others

Describe

Update paddle fluid platform for rocm35 - part1

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@qili93 qili93 force-pushed the rocm_platform_part1 branch from 50b5659 to 6710055 Compare January 25, 2021 13:27
@qili93 qili93 requested review from chenwhql and removed request for luotao1 January 26, 2021 02:15
chenwhql
chenwhql previously approved these changes Jan 26, 2021
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE change


inline const char* rocblasGetErrorString(rocblas_status stat) {
switch (stat) {
case rocblas_status_invalid_handle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户反馈我们第三方库报错只有一个status,没说具体原因,然后在搜索引擎又不能比较快的找到官网解释的话,用户体验会比较差,这块 @zhouwei25 后续还会做一些增强,可以关注下

Copy link
Contributor

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前可以先简写一版报错,后面如果官网有报错支持,后续可能统一把AMD这几种也压缩到cudaerrormessage.pb里,这个文件目前仅集成了NvidiaGPU的报错内容

return webstr.str();
}

inline std::string build_nvidia_error_msg(hipError_t e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是Nvidia 5种类型的报错统一接口,是将官网信息映射为 报错码+报错内容 的形式压缩到一个cudaerrormessage.pb的文件里去,AMD GPU的报错信息可以叫build_amd_error_msg,现在那个cudaerrormessage.pb只有Nvidia的部分,没有AMD的,可以先不走这块查询逻辑,因为肯定查不到。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成build_rocm_error_msg

/***** HIP ERROR *****/
inline bool is_error(hipError_t e) { return e != hipSuccess; }

inline std::string GetCudaErrorWebsite(int32_t cuda_version) {
Copy link
Contributor

@zhwesky2010 zhwesky2010 Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个也可以先不写

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉了GetCudaErrorWebsite

int32_t cuda_version = -1;
#endif
std::ostringstream sout;
sout << " Hip error(" << e << "), " << hipGetErrorString(e) << ".";
Copy link
Contributor

@zhwesky2010 zhwesky2010 Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先打出hipGetErrorString(e)这部分,后面的逻辑目前无法触发可以先不用写

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除了hipGetErrorString(e)之后的error string的逻辑

Copy link
Contributor

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qili93 qili93 merged commit f89da4a into PaddlePaddle:develop Jan 28, 2021
@qili93 qili93 deleted the rocm_platform_part1 branch January 28, 2021 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants