-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858
Conversation
* fix some uts' interfaces * add todos and fix err_messages
fix,test=notest fix,test=notest fix,test=notest fix,test=notest fix
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续TODO建议 @vslyu
1、统一gen_nccl_id、c_gen_nccl_id、gen_bkcl_id、c_gen_bkcl_id,叫c_gen_comm_id,采用kernel op的形式。
2、c_comm_init_op,采用kernel op的形式。
@@ -0,0 +1,194 @@ | |||
/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个pr改掉.
@@ -566,12 +574,12 @@ def run_gpu_fleet_api_trainer(self, args): | |||
args.trainer_id = paddle.distributed.get_rank() | |||
|
|||
# 3. init parallel env | |||
if args.update_method == "nccl2": | |||
if args.update_method == "nccl2" or "bkcl": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fleet is paddle.fluid.incubate.fleet, maybe need another test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New paddle.distributed.fleet API interface is imported again In this run_use_fleet_api_trainer function:
import paddle.distributed.fleet as fleet
import paddle.distributed.fleet.base.role_maker as role_maker
so , the unittest finally runs with new paddle.distributed.fleet API.
import os | ||
import copy | ||
import sys | ||
sys.path.append("..") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from ..launch_function_helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
保留sys.path.append(".."),保持和其他xpu单测一致吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for PADDLE_ENFORCE, comment can be fixed in next PR
platform::errors::PreconditionNotMet( | ||
"CCommInitOp can run on gpu place only.")); | ||
"CCommInitOp can run on gpu or xpu place only.")); | ||
|
||
auto var = scope.FindVar(Input("X")); | ||
PADDLE_ENFORCE_NOT_NULL( | ||
var, platform::errors::InvalidArgument("Input con not be empty.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
con
-> can
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
收到,下个pr改掉.
PADDLE_THROW(platform::errors::PreconditionNotMet( | ||
"PaddlePaddle should compile with GPU.")); | ||
PADDLE_THROW(platform::errors::PreconditionNotMet( | ||
"PaddlePaddle should compile with GPU.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should compile
-> should be compiled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
收到,下个pr改掉.
bkcl_id, nranks, rank_id, device_id, rid); | ||
#else | ||
PADDLE_THROW(platform::errors::PreconditionNotMet( | ||
"PaddlePaddle should compile with XPU.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
收到,下个pr改掉.
收到,待xpu动态图,静态图性能优化时再考虑精简代码。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for framework.py
PR types
New features
PR changes
Others
Describe