overlap rpc op memcpy in distributed training #11221

Yancey1989 · 2018-06-06T04:42:02Z

After parallel bcast, this PR will improve performacne about 15% on vgg +flowers + 2trainers + 2pservers

overlap memcpy branch

Pass = 0, Elapsed = 166, Training performance = 36.934574 imgs/s, Train accuracy = 0.027809, Test accuracy = 0.009874
Pass = 1, Elapsed = 163, Training performance = 37.542919 imgs/s, Train accuracy = 0.038393, Test accuracy = 0.009804

develop branch

Pass = 0, Elapsed = 195, Training performance = 31.470824 imgs/s, Train accuracy = 0.028873, Test accuracy = 0.008731
Pass = 1, Elapsed = 177, Training performance = 34.738449 imgs/s, Train accuracy = 0.044999, Test accuracy = 0.015925

The improvement would be better on resnet, because the parameter size is smaller than vgg.

…_memcpy_with_dist

typhoonzero · 2018-06-08T08:38:01Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

@@ -145,6 +145,7 @@ bool MultiDevSSAGraphBuilder::IsDistTrainOp(

 std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
    const ProgramDesc &program) const {
+  VLOG(3) << "Building ....";


Can remove this debugging log

typhoonzero · 2018-06-08T11:26:41Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

@@ -187,15 +188,53 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
  };

  bool is_forwarding = true;
+  int rpc_op_device_id = 0;
+  auto schedule_rpc_op = [&]() -> void {


Can this be changed to use get_appropriate_dev so that when using parallel executor stratagy "Reduce", the variable to send is same as the variable reduced on that device? @chengduoZH am I right?

After disscuess with @typhoonzero @chengduoZH @panyx0718 , use get_appropriate_dev to schedule rpc op.

panyx0718 · 2018-06-09T06:57:37Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+        }
+        CreateDistTrainOp(&result, *op, rpc_op_device_id);
+      }
+      if (op->Type() == "concat") {


else if? otherwise CreateDistTrainOp will be called again in the else below?

Sorry, here should be else if.

panyx0718 · 2018-06-09T07:50:10Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+        auto got = remote_vars_devices_.find(op->InputArgumentNames()[0]);
+        if (got == remote_vars_devices_.end()) {
+          schedule_rpc_op();
+        } else {


When is else triggered? I guess you want to round-robin the send devices?

Not quite remember. In Reduce mode, is gradients calculated in 1 device instead of all devices? Should the send device match that device?

DistributedTranspiler would split a parameter into some parameter blocks, but this is not applicable to each parameter if it's small enough, we didn't split it. So the op pipeline is such as:

(split_byref)->send->recv->(concat)

For the true block of the if statement: there is no split_byref operator before send, so we need to schedule send to the right place.
For the false block, split_byref has already been schedule to a device, and send should be scheduled to the same device with split_byref

panyx0718 · 2018-06-09T07:53:56Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+          rpc_op_device_id = got->second;
+        }
+        CreateRPCOp(&result, *op, rpc_op_device_id);
+      } else if (op->Type() == "recv") {


Will only 1 device perform broadcast in Reduce mode? So recv should be done on that device before broadcast? Perhaps take a look at get_appropriate_dev? I'm not quite sure the details

I'll take a look at get_appropriate_dev and find the relationship with this PR.

panyx0718 · 2018-06-09T07:57:33Z

paddle/fluid/framework/parallel_executor.cc

-        platform::dynload::ncclBcast(buffer, numel, data_type, 0,
-                                     nccl_ctx.comm_, nccl_ctx.stream());
+
+        if (builder_.get() != nullptr &&


Not sure this change is needed. It seems BCastParamsToGPUs is only called once at beginning to bcast parameter to each devices. It's not used during training?

It's used during training at the end of eatch mini-batch, such as the following code snippet:

Paddle/benchmark/fluid/fluid_benchmark.py

Lines 363 to 365 in 831909c

if args.update_method == "pserver":

exe.bcast_params()

if args.use_reader_op:

…_memcpy_with_dist

typhoonzero · 2018-06-12T09:41:34Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+      if (op->Type() == "send_vars") {
+        int op_dev_id = GetVarDeviceID(op->InputArgumentNames()[0]);
+        if (op_dev_id == -1) {
+          op_dev_id = get_appropriate_dev(op->InputArgumentNames());


@Yancey1989 as we discussed, one concern, the order when calling get_appropriate_dev must be the same to reduce and split_op or the device id for the variable may be different.

Thanks, done.

chengduoZH · 2018-06-13T11:51:13Z

paddle/fluid/framework/details/multi_devices_graph_builder.h


 private:
  BuildStrategy strategy_;
+  mutable std::unordered_map<std::string, VarDesc *> all_vars_;
+  mutable std::unordered_map<std::string, int> var_name_on_devices_;


We should not use unordered_map to record the var_name on devices, because the same var_name may be on different devices.

May not, this does not record all variables, only used for Reduce strategy and distributed training.

For the Reduce strategy, we schedule Reduce Op on the different device and record the gradient variable name in var_name_on_devices_ , so it would only appear on only one device.

For the distributed training, the same as Reduce strategy, we schedule send_op and recv_op on the different device, the variable name would not appear on the different device also.

…_memcpy_with_dist

panyx0718

I feel that there are something that we can implement cleaner. But I guess we can do them as followup.

panyx0718 · 2018-06-19T12:54:34Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

  for (auto *var : program.Block(0).AllVars()) {
-    all_vars[var->Name()] = var;
+    all_vars_.emplace(var->Name(), var);


emplace has difference semantics from []? If not necessary, let's it keep it the same.

Not too much, just can avoid some non-necessary copy, but it's no difference here.

panyx0718 · 2018-06-19T13:01:47Z

paddle/fluid/framework/parallel_executor.cc

-        platform::dynload::ncclBcast(buffer, numel, data_type, 0,
-                                     nccl_ctx.comm_, nccl_ctx.stream());
+
+        if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) {


builder_.get() != nullptr -> builder_

panyx0718 · 2018-06-19T13:34:35Z

paddle/fluid/framework/parallel_executor.cc

-        platform::dynload::ncclBcast(buffer, numel, data_type, 0,
-                                     nccl_ctx.comm_, nccl_ctx.stream());
+
+        if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) {


I feel that GetVarDeviceID should probably be a method of a built graph or executor. This avoids making builder_ a private member. But I guess it's ok to leave it as TOOD for now.

panyx0718 · 2018-06-20T02:43:35Z

paddle/fluid/framework/details/multi_devices_graph_builder.cc

+  balance_vars_[dev_id] += numel_sum;
+  return dev_id;
+}
+
 std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(


After this change. Build() can only be called once? Do we want to clear "balanced_vars", "all_vars", etc at the beginning of Build()?

panyx0718 · 2018-06-20T03:17:52Z

In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it.

…_memcpy_with_dist

Yancey1989 · 2018-06-20T04:32:05Z

In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it.

It's a good idea, I move them as a private member just because of I want to expose GetVarDeviceID interface to ParallelExecutor::BCastParamsToGPUs, another approach which doesn't move them as private members is insert broadcast_op_handl while building the ssa-graph.

And can also fix #11593 .

Yancey1989 · 2018-06-20T16:00:09Z

some acc test:
local=1, batch_size=160, 4*GPU

Pass = 0, Elapsed = 41, Training performance = 146.812973 imgs/s, Train accuracy = 0.026151, Test accuracy = 0.010417
Pass = 1, Elapsed = 39, Training performance = 152.322642 imgs/s, Train accuracy = 0.041118, Test accuracy = 0.007292
Pass = 2, Elapsed = 39, Training performance = 153.547463 imgs/s, Train accuracy = 0.043750, Test accuracy = 0.008333

local=0,batch_size=80,trainers=2,pservers=2

Pass = 0, Elapsed = 127, Training performance = 47.504918 imgs/s, Train accuracy = 0.023849, Test accuracy = 0.009375
Pass = 1, Elapsed = 121, Training performance = 50.227834 imgs/s, Train accuracy = 0.024013, Test accuracy = 0.010417
Pass = 2, Elapsed = 120, Training performance = 50.489441 imgs/s, Train accuracy = 0.026974, Test accuracy = 0.010417
Pass = 3, Elapsed = 120, Training performance = 50.426222 imgs/s, Train accuracy = 0.028125, Test accuracy = 0.011458

Yancey1989 added 2 commits June 6, 2018 12:40

overlap rpc op memcpy in distributed training

93401c9

code cleanup

6d69ae0

Yancey1989 requested review from reyoung, typhoonzero and panyx0718 June 6, 2018 04:42

Yancey1989 added 4 commits June 6, 2018 15:10

fix op name typo

82d741c

fix compile failed with CPU

cb38615

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

e533a4b

…_memcpy_with_dist

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

15913d9

…_memcpy_with_dist

Yancey1989 mentioned this pull request Jun 7, 2018

[performance]Schedule send/recv op from gpu0 to all devices to overlap memcpy #11143

Closed

Yancey1989 added 2 commits June 7, 2018 14:36

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

23433de

…_memcpy_with_dist

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

d5a88b9

…_memcpy_with_dist

typhoonzero reviewed Jun 8, 2018

View reviewed changes

panyx0718 reviewed Jun 9, 2018

View reviewed changes

Yancey1989 added 2 commits June 12, 2018 10:34

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

4444e79

…_memcpy_with_dist

use get_appropriate_dev to schedule rpc op

6d752ba

Yancey1989 requested a review from chengduoZH June 12, 2018 07:06

typhoonzero reviewed Jun 12, 2018

View reviewed changes

update by comment

f52d78d

chengduoZH reviewed Jun 13, 2018

View reviewed changes

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

3d875b6

…_memcpy_with_dist

panyx0718 previously approved these changes Jun 20, 2018

View reviewed changes

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into overlap…

7d1b146

…_memcpy_with_dist

Yancey1989 dismissed panyx0718’s stale review via 7d1b146 June 20, 2018 04:26

fix compile warning

7e6518e

panyx0718 approved these changes Jun 20, 2018

View reviewed changes

Yancey1989 merged commit 9cc1eb4 into PaddlePaddle:develop Jun 20, 2018

Yancey1989 deleted the overlap_memcpy_with_dist branch June 20, 2018 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlap rpc op memcpy in distributed training #11221

overlap rpc op memcpy in distributed training #11221

Yancey1989 commented Jun 6, 2018 •

edited

Loading

typhoonzero Jun 8, 2018

Yancey1989 Jun 12, 2018

typhoonzero Jun 8, 2018

Yancey1989 Jun 12, 2018

panyx0718 Jun 9, 2018

Yancey1989 Jun 11, 2018

panyx0718 Jun 9, 2018

Yancey1989 Jun 11, 2018

panyx0718 Jun 9, 2018

Yancey1989 Jun 11, 2018

panyx0718 Jun 9, 2018

Yancey1989 Jun 11, 2018

typhoonzero Jun 12, 2018

Yancey1989 Jun 12, 2018

chengduoZH Jun 13, 2018

Yancey1989 Jun 14, 2018

panyx0718 left a comment

panyx0718 Jun 19, 2018

Yancey1989 Jun 20, 2018

panyx0718 Jun 19, 2018

panyx0718 Jun 19, 2018

panyx0718 Jun 20, 2018

panyx0718 commented Jun 20, 2018

Yancey1989 commented Jun 20, 2018 •

edited

Loading

Yancey1989 commented Jun 20, 2018

	if args.update_method == "pserver":
	exe.bcast_params()
	if args.use_reader_op:

overlap rpc op memcpy in distributed training #11221

overlap rpc op memcpy in distributed training #11221

Conversation

Yancey1989 commented Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panyx0718 commented Jun 20, 2018

Yancey1989 commented Jun 20, 2018 • edited Loading

Yancey1989 commented Jun 20, 2018

Yancey1989 commented Jun 6, 2018 •

edited

Loading

Yancey1989 commented Jun 20, 2018 •

edited

Loading