[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

vslyu · 2021-02-03T03:43:28Z

PR types

New features

PR changes

Others

Describe

add gen_bkcl_id_op for multi Baidu Kunlun cards training
support fleet api for multi Baidu Kunlun cards training

* fix some uts' interfaces * add todos and fix err_messages

fix,test=notest fix,test=notest fix,test=notest fix,test=notest fix

paddle-bot-old · 2021-02-03T03:43:31Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wangxicoding

后续TODO建议 @vslyu
1、统一gen_nccl_id、c_gen_nccl_id、gen_bkcl_id、c_gen_bkcl_id，叫c_gen_comm_id，采用kernel op的形式。
2、c_comm_init_op，采用kernel op的形式。

wangxicoding · 2021-02-04T12:16:29Z

paddle/fluid/operators/collective/gen_bkcl_id_op.cc

@@ -0,0 +1,194 @@
+/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.


下个pr改掉.

wangxicoding · 2021-02-04T12:25:48Z

python/paddle/fluid/tests/unittests/test_dist_base.py

@@ -566,12 +574,12 @@ def run_gpu_fleet_api_trainer(self, args):
        args.trainer_id = paddle.distributed.get_rank()

        # 3. init parallel env
-        if args.update_method == "nccl2":
+        if args.update_method == "nccl2" or "bkcl":


this fleet is paddle.fluid.incubate.fleet, maybe need another test

New paddle.distributed.fleet API interface is imported again In this run_use_fleet_api_trainer function:

import paddle.distributed.fleet as fleet import paddle.distributed.fleet.base.role_maker as role_maker

so , the unittest finally runs with new paddle.distributed.fleet API.

wangxicoding · 2021-02-04T12:28:22Z

python/paddle/fluid/tests/unittests/xpu/test_gen_bkcl_id_op.py

+import os
+import copy
+import sys
+sys.path.append("..")


from ..launch_function_helper

保留sys.path.append("..")，保持和其他xpu单测一致吧

chenwhql

LGTM for PADDLE_ENFORCE, comment can be fixed in next PR

chenwhql · 2021-02-05T05:40:50Z

paddle/fluid/operators/collective/c_comm_init_op.cc

                      platform::errors::PreconditionNotMet(
-                          "CCommInitOp can run on gpu place only."));
+                          "CCommInitOp can run on gpu or xpu place only."));

    auto var = scope.FindVar(Input("X"));
    PADDLE_ENFORCE_NOT_NULL(
        var, platform::errors::InvalidArgument("Input con not be empty."));


con -> can?

收到，下个pr改掉.

chenwhql · 2021-02-05T05:41:50Z

paddle/fluid/operators/collective/c_comm_init_op.cc

-    PADDLE_THROW(platform::errors::PreconditionNotMet(
-        "PaddlePaddle should compile with GPU."));
+      PADDLE_THROW(platform::errors::PreconditionNotMet(
+          "PaddlePaddle should compile with GPU."));


should compile -> should be compiled?

收到，下个pr改掉.

chenwhql · 2021-02-05T05:42:48Z

paddle/fluid/operators/collective/c_comm_init_op.cc

+          bkcl_id, nranks, rank_id, device_id, rid);
+#else
+      PADDLE_THROW(platform::errors::PreconditionNotMet(
+          "PaddlePaddle should compile with XPU."));


收到，下个pr改掉.

vslyu · 2021-02-05T08:49:11Z

后续TODO建议 @vslyu
1、统一gen_nccl_id、c_gen_nccl_id、gen_bkcl_id、c_gen_bkcl_id，叫c_gen_comm_id，采用kernel op的形式。
2、c_comm_init_op，采用kernel op的形式。

收到，待xpu动态图，静态图性能优化时再考虑精简代码。

luotao1

LGTM for framework.py

wangxicoding and others added 12 commits February 1, 2021 03:33

Dygraph supports multi xpu card training

ac29523

add bkcl_context_test and fix error messages (PaddlePaddle#1)

17e3d09

Fix XPU fill_constant place, to_tensor place

00ce84a

Mulit xpu dygraph (PaddlePaddle#2)

903fade

Add XPUDeviceGuard

34ae712

Fix multi xpu dygraph (PaddlePaddle#3)

240f715

* fix some uts' interfaces * add todos and fix err_messages

fix copy

153bdf3

Fix GetDeviceId

07312a7

add gen_bkcl_id_op, test=notest

ccf9985

fix,test=notest fix,test=notest fix,test=notest fix,test=notest fix

Merge branch 'dev/add_xpu_mp0' into dev/add_xpu_mp1

cb0c9c9

fix test_gen_bkcl_id_op.py

41f9c76

merge upstream/develop

4d4cd3a

vslyu added 6 commits February 3, 2021 07:46

merge upstream/develop, fix CMakeLists.txt confilcts

56b2368

mv test_gen_bkcl_id_op.py into xpu subdirs

f767073

add fleet api for xpu init

9bdd19c

add fleet api unittest for xpu

c79158c

fix fleet api mnist unittest for xpu

37764c6

rename gpu_fleet_api to use_fleet_api

f065517

wangxicoding requested review from wangxicoding and sandyhouse February 4, 2021 11:31

wangxicoding reviewed Feb 4, 2021

View reviewed changes

fix c_comm_init_op.cc

786d326

chenwhql approved these changes Feb 5, 2021

View reviewed changes

wangxicoding approved these changes Feb 5, 2021

View reviewed changes

luotao1 approved these changes Feb 5, 2021

View reviewed changes

phlrain approved these changes Feb 5, 2021

View reviewed changes

wangxicoding merged commit 4a8b8b4 into PaddlePaddle:develop Feb 5, 2021

wangxicoding mentioned this pull request Feb 22, 2021

【kunlun】dygraph supports multi xpu card training #30671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

vslyu commented Feb 3, 2021 •

edited

Loading

paddle-bot-old bot commented Feb 3, 2021

wangxicoding left a comment

wangxicoding Feb 4, 2021

vslyu Feb 5, 2021

wangxicoding Feb 4, 2021

vslyu Feb 5, 2021

wangxicoding Feb 4, 2021

vslyu Feb 5, 2021

chenwhql left a comment

chenwhql Feb 5, 2021 •

edited

Loading

vslyu Feb 5, 2021

chenwhql Feb 5, 2021

vslyu Feb 5, 2021

chenwhql Feb 5, 2021

vslyu Feb 5, 2021

vslyu commented Feb 5, 2021

luotao1 left a comment

		@@ -0,0 +1,194 @@
		/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.

[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

[Kunlun] add gen_bkcl_id_op, support multi XPU cards training using multiprocess #30858

Conversation

vslyu commented Feb 3, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 3, 2021

wangxicoding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

chenwhql Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vslyu commented Feb 5, 2021

luotao1 left a comment

Choose a reason for hiding this comment

vslyu commented Feb 3, 2021 •

edited

Loading

chenwhql Feb 5, 2021 •

edited

Loading