Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlap rpc op memcpy in distributed training #11221

Merged
merged 14 commits into from
Jun 20, 2018

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented Jun 6, 2018

Fixed #11143

After parallel bcast, this PR will improve performacne about 15% on vgg +flowers + 2trainers + 2pservers

overlap memcpy branch

Pass = 0, Elapsed = 166, Training performance = 36.934574 imgs/s, Train accuracy = 0.027809, Test accuracy = 0.009874
Pass = 1, Elapsed = 163, Training performance = 37.542919 imgs/s, Train accuracy = 0.038393, Test accuracy = 0.009804

develop branch

Pass = 0, Elapsed = 195, Training performance = 31.470824 imgs/s, Train accuracy = 0.028873, Test accuracy = 0.008731
Pass = 1, Elapsed = 177, Training performance = 34.738449 imgs/s, Train accuracy = 0.044999, Test accuracy = 0.015925

The improvement would be better on resnet, because the parameter size is smaller than vgg.

@@ -145,6 +145,7 @@ bool MultiDevSSAGraphBuilder::IsDistTrainOp(

std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
const ProgramDesc &program) const {
VLOG(3) << "Building ....";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this debugging log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -187,15 +188,53 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
};

bool is_forwarding = true;
int rpc_op_device_id = 0;
auto schedule_rpc_op = [&]() -> void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be changed to use get_appropriate_dev so that when using parallel executor stratagy "Reduce", the variable to send is same as the variable reduced on that device? @chengduoZH am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After disscuess with @typhoonzero @chengduoZH @panyx0718 , use get_appropriate_dev to schedule rpc op.

}
CreateDistTrainOp(&result, *op, rpc_op_device_id);
}
if (op->Type() == "concat") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if? otherwise CreateDistTrainOp will be called again in the else below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, here should be else if.

auto got = remote_vars_devices_.find(op->InputArgumentNames()[0]);
if (got == remote_vars_devices_.end()) {
schedule_rpc_op();
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is else triggered? I guess you want to round-robin the send devices?

Not quite remember. In Reduce mode, is gradients calculated in 1 device instead of all devices? Should the send device match that device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedTranspiler would split a parameter into some parameter blocks, but this is not applicable to each parameter if it's small enough, we didn't split it. So the op pipeline is such as:

(split_byref)->send->recv->(concat)

For the true block of the if statement: there is no split_byref operator before send, so we need to schedule send to the right place.
For the false block, split_byref has already been schedule to a device, and send should be scheduled to the same device with split_byref

rpc_op_device_id = got->second;
}
CreateRPCOp(&result, *op, rpc_op_device_id);
} else if (op->Type() == "recv") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will only 1 device perform broadcast in Reduce mode? So recv should be done on that device before broadcast? Perhaps take a look at get_appropriate_dev? I'm not quite sure the details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at get_appropriate_dev and find the relationship with this PR.

platform::dynload::ncclBcast(buffer, numel, data_type, 0,
nccl_ctx.comm_, nccl_ctx.stream());

if (builder_.get() != nullptr &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this change is needed. It seems BCastParamsToGPUs is only called once at beginning to bcast parameter to each devices. It's not used during training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used during training at the end of eatch mini-batch, such as the following code snippet:

if args.update_method == "pserver":
exe.bcast_params()
if args.use_reader_op:

if (op->Type() == "send_vars") {
int op_dev_id = GetVarDeviceID(op->InputArgumentNames()[0]);
if (op_dev_id == -1) {
op_dev_id = get_appropriate_dev(op->InputArgumentNames());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 as we discussed, one concern, the order when calling get_appropriate_dev must be the same to reduce and split_op or the device id for the variable may be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.


private:
BuildStrategy strategy_;
mutable std::unordered_map<std::string, VarDesc *> all_vars_;
mutable std::unordered_map<std::string, int> var_name_on_devices_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use unordered_map to record the var_name on devices, because the same var_name may be on different devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May not, this does not record all variables, only used for Reduce strategy and distributed training.

For the Reduce strategy, we schedule Reduce Op on the different device and record the gradient variable name in var_name_on_devices_ , so it would only appear on only one device.

For the distributed training, the same as Reduce strategy, we schedule send_op and recv_op on the different device, the variable name would not appear on the different device also.

panyx0718
panyx0718 previously approved these changes Jun 20, 2018
Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that there are something that we can implement cleaner. But I guess we can do them as followup.

for (auto *var : program.Block(0).AllVars()) {
all_vars[var->Name()] = var;
all_vars_.emplace(var->Name(), var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emplace has difference semantics from []? If not necessary, let's it keep it the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too much, just can avoid some non-necessary copy, but it's no difference here.

platform::dynload::ncclBcast(buffer, numel, data_type, 0,
nccl_ctx.comm_, nccl_ctx.stream());

if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

builder_.get() != nullptr -> builder_

platform::dynload::ncclBcast(buffer, numel, data_type, 0,
nccl_ctx.comm_, nccl_ctx.stream());

if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that GetVarDeviceID should probably be a method of a built graph or executor. This avoids making builder_ a private member. But I guess it's ok to leave it as TOOD for now.

balance_vars_[dev_id] += numel_sum;
return dev_id;
}

std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change. Build() can only be called once? Do we want to clear "balanced_vars", "all_vars", etc at the beginning of Build()?

@panyx0718
Copy link
Contributor

In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Jun 20, 2018

In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it.

It's a good idea, I move them as a private member just because of I want to expose GetVarDeviceID interface to ParallelExecutor::BCastParamsToGPUs, another approach which doesn't move them as private members is insert broadcast_op_handl while building the ssa-graph.

And can also fix #11593 .

@Yancey1989
Copy link
Contributor Author

some acc test:
local=1, batch_size=160, 4*GPU

Pass = 0, Elapsed = 41, Training performance = 146.812973 imgs/s, Train accuracy = 0.026151, Test accuracy = 0.010417
Pass = 1, Elapsed = 39, Training performance = 152.322642 imgs/s, Train accuracy = 0.041118, Test accuracy = 0.007292
Pass = 2, Elapsed = 39, Training performance = 153.547463 imgs/s, Train accuracy = 0.043750, Test accuracy = 0.008333

local=0,batch_size=80,trainers=2,pservers=2

Pass = 0, Elapsed = 127, Training performance = 47.504918 imgs/s, Train accuracy = 0.023849, Test accuracy = 0.009375
Pass = 1, Elapsed = 121, Training performance = 50.227834 imgs/s, Train accuracy = 0.024013, Test accuracy = 0.010417
Pass = 2, Elapsed = 120, Training performance = 50.489441 imgs/s, Train accuracy = 0.026974, Test accuracy = 0.010417
Pass = 3, Elapsed = 120, Training performance = 50.426222 imgs/s, Train accuracy = 0.028125, Test accuracy = 0.011458

@Yancey1989 Yancey1989 merged commit 9cc1eb4 into PaddlePaddle:develop Jun 20, 2018
@Yancey1989 Yancey1989 deleted the overlap_memcpy_with_dist branch June 20, 2018 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants