Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

Merged
merged 19 commits into from
Feb 5, 2021

Conversation

vslyu
Copy link
Contributor

@vslyu vslyu commented Feb 3, 2021

PR types

New features

PR changes

Others

Describe

  • add gen_bkcl_id_op for multi Baidu Kunlun cards training
  • support fleet api for multi Baidu Kunlun cards training

@paddle-bot-old
Copy link

paddle-bot-old bot commented Feb 3, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后续TODO建议 @vslyu
1、统一gen_nccl_id、c_gen_nccl_id、gen_bkcl_id、c_gen_bkcl_id,叫c_gen_comm_id,采用kernel op的形式。
2、c_comm_init_op,采用kernel op的形式。

@@ -0,0 +1,194 @@
/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个pr改掉.

@@ -566,12 +574,12 @@ def run_gpu_fleet_api_trainer(self, args):
args.trainer_id = paddle.distributed.get_rank()

# 3. init parallel env
if args.update_method == "nccl2":
if args.update_method == "nccl2" or "bkcl":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fleet is paddle.fluid.incubate.fleet, maybe need another test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New paddle.distributed.fleet API interface is imported again In this run_use_fleet_api_trainer function:

import paddle.distributed.fleet as fleet
import paddle.distributed.fleet.base.role_maker as role_maker

so , the unittest finally runs with new paddle.distributed.fleet API.

import os
import copy
import sys
sys.path.append("..")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from ..launch_function_helper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

保留sys.path.append(".."),保持和其他xpu单测一致吧

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for PADDLE_ENFORCE, comment can be fixed in next PR

platform::errors::PreconditionNotMet(
"CCommInitOp can run on gpu place only."));
"CCommInitOp can run on gpu or xpu place only."));

auto var = scope.FindVar(Input("X"));
PADDLE_ENFORCE_NOT_NULL(
var, platform::errors::InvalidArgument("Input con not be empty."));
Copy link
Contributor

@chenwhql chenwhql Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

con -> can?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,下个pr改掉.

PADDLE_THROW(platform::errors::PreconditionNotMet(
"PaddlePaddle should compile with GPU."));
PADDLE_THROW(platform::errors::PreconditionNotMet(
"PaddlePaddle should compile with GPU."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should compile -> should be compiled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,下个pr改掉.

bkcl_id, nranks, rank_id, device_id, rid);
#else
PADDLE_THROW(platform::errors::PreconditionNotMet(
"PaddlePaddle should compile with XPU."));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,下个pr改掉.

@vslyu
Copy link
Contributor Author

vslyu commented Feb 5, 2021

后续TODO建议 @vslyu
1、统一gen_nccl_id、c_gen_nccl_id、gen_bkcl_id、c_gen_bkcl_id,叫c_gen_comm_id,采用kernel op的形式。
2、c_comm_init_op,采用kernel op的形式。

收到,待xpu动态图,静态图性能优化时再考虑精简代码。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for framework.py

@wangxicoding wangxicoding merged commit 4a8b8b4 into PaddlePaddle:develop Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants