Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kunlun]fix multi xpu dygraph hang, test=kunlun #32662

Merged
merged 3 commits into from
Apr 29, 2021

Conversation

vslyu
Copy link
Contributor

@vslyu vslyu commented Apr 28, 2021

PR types

Bug fixes

PR changes

Others

Describe

Fix multi kunlun XPU cards dygraph running hang.
ref to PaddleClas [PR690]:PaddlePaddle/PaddleClas#690

python3.7 -m paddle.distributed.launch --xpus=2,3 --log_dir log tools/train.py -c ./configs/quick_start/ResNet50_vd_finetune_kunlun.yaml
2021-04-29 08:21:19,988 - INFO - epoch:19 , train step:0   , top1: 0.93750, top5: 1.00000, loss: 0.92145, lr: 0.000023, batch_cost: 0.96966 s, reader_cost: 0.23445 s, ips: 33.00113 images/sec, eta: 0:00:14
2021-04-29 08:21:26,911 - INFO - epoch:19 , train step:10  , top1: 0.93750, top5: 1.00000, loss: 1.03136, lr: 0.000003, batch_cost: 0.69550 s, reader_cost: 0.00019 s, ips: 46.01037 images/sec, eta: 0:00:03
2021-04-29 08:21:29,695 - INFO - END epoch:19  train top1: 0.92292, top5: 0.98542, loss: 0.95281,  batch_cost: 0.69007 s, reader_cost: 0.00025 s, batch_cost_sum: 3.45033 s, ips: 46.37236 images/sec.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -762,7 +762,7 @@ void Reducer::MarkGroupReady(size_t group_index) {
// TODO(liuyuhui): Add try catch to deal with exception later,
// otherwise the main thread will continue to run when an exception is
// thrown in comm_pool_.
comm_pool_->enqueue([&] {
comm_pool_->enqueue([=, &group] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个个指定需要传入的变量。包括this,run_order,另外next_group_是this里面的成员变量,需要用值传入的方式,可以c++14可以[next_group = next_group_]传入,或者先定义好再值传入。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -762,10 +762,11 @@ void Reducer::MarkGroupReady(size_t group_index) {
// TODO(liuyuhui): Add try catch to deal with exception later,
// otherwise the main thread will continue to run when an exception is
// thrown in comm_pool_.
comm_pool_->enqueue([&] {
auto next_group = next_group_;
comm_pool_->enqueue([&, run_order, next_group] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this, run_order, next_group, &group
明确到各个变量,不要直接用一个&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants