-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kunlun]fix multi xpu dygraph hang, test=kunlun #32662
Conversation
Thanks for your contribution! |
paddle/fluid/imperative/reducer.cc
Outdated
@@ -762,7 +762,7 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
// TODO(liuyuhui): Add try catch to deal with exception later, | |||
// otherwise the main thread will continue to run when an exception is | |||
// thrown in comm_pool_. | |||
comm_pool_->enqueue([&] { | |||
comm_pool_->enqueue([=, &group] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个个指定需要传入的变量。包括this,run_order,另外next_group_是this里面的成员变量,需要用值传入的方式,可以c++14可以[next_group = next_group_]传入,或者先定义好再值传入。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/fluid/imperative/reducer.cc
Outdated
@@ -762,10 +762,11 @@ void Reducer::MarkGroupReady(size_t group_index) { | |||
// TODO(liuyuhui): Add try catch to deal with exception later, | |||
// otherwise the main thread will continue to run when an exception is | |||
// thrown in comm_pool_. | |||
comm_pool_->enqueue([&] { | |||
auto next_group = next_group_; | |||
comm_pool_->enqueue([&, run_order, next_group] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this, run_order, next_group, &group
明确到各个变量,不要直接用一个&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Others
Describe
Fix multi kunlun XPU cards dygraph running hang.
ref to PaddleClas [PR690]:PaddlePaddle/PaddleClas#690