-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlap rpc op memcpy in distributed training #11221
overlap rpc op memcpy in distributed training #11221
Conversation
@@ -145,6 +145,7 @@ bool MultiDevSSAGraphBuilder::IsDistTrainOp( | |||
|
|||
std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( | |||
const ProgramDesc &program) const { | |||
VLOG(3) << "Building ...."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this debugging log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -187,15 +188,53 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( | |||
}; | |||
|
|||
bool is_forwarding = true; | |||
int rpc_op_device_id = 0; | |||
auto schedule_rpc_op = [&]() -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to use get_appropriate_dev
so that when using parallel executor stratagy "Reduce", the variable to send is same as the variable reduced on that device? @chengduoZH am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After disscuess with @typhoonzero @chengduoZH @panyx0718 , use get_appropriate_dev
to schedule rpc op.
} | ||
CreateDistTrainOp(&result, *op, rpc_op_device_id); | ||
} | ||
if (op->Type() == "concat") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else if? otherwise CreateDistTrainOp will be called again in the else below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, here should be else if
.
auto got = remote_vars_devices_.find(op->InputArgumentNames()[0]); | ||
if (got == remote_vars_devices_.end()) { | ||
schedule_rpc_op(); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is else
triggered? I guess you want to round-robin the send devices?
Not quite remember. In Reduce mode, is gradients calculated in 1 device instead of all devices? Should the send device match that device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DistributedTranspiler
would split a parameter into some parameter blocks, but this is not applicable to each parameter if it's small enough, we didn't split it. So the op pipeline is such as:
(split_byref)->send->recv->(concat)
For the true block of the if statement: there is no split_byref
operator before send
, so we need to schedule send
to the right place.
For the false block, split_byref
has already been schedule to a device, and send
should be scheduled to the same device with split_byref
rpc_op_device_id = got->second; | ||
} | ||
CreateRPCOp(&result, *op, rpc_op_device_id); | ||
} else if (op->Type() == "recv") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will only 1 device perform broadcast in Reduce mode? So recv should be done on that device before broadcast? Perhaps take a look at get_appropriate_dev
? I'm not quite sure the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at get_appropriate_dev
and find the relationship with this PR.
platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
||
if (builder_.get() != nullptr && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this change is needed. It seems BCastParamsToGPUs
is only called once at beginning to bcast parameter to each devices. It's not used during training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used during training at the end of eatch mini-batch, such as the following code snippet:
Paddle/benchmark/fluid/fluid_benchmark.py
Lines 363 to 365 in 831909c
if args.update_method == "pserver": | |
exe.bcast_params() | |
if args.use_reader_op: |
if (op->Type() == "send_vars") { | ||
int op_dev_id = GetVarDeviceID(op->InputArgumentNames()[0]); | ||
if (op_dev_id == -1) { | ||
op_dev_id = get_appropriate_dev(op->InputArgumentNames()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 as we discussed, one concern, the order when calling get_appropriate_dev must be the same to reduce and split_op or the device id for the variable may be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
|
||
private: | ||
BuildStrategy strategy_; | ||
mutable std::unordered_map<std::string, VarDesc *> all_vars_; | ||
mutable std::unordered_map<std::string, int> var_name_on_devices_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use unordered_map to record the var_name on devices, because the same var_name may be on different devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not, this does not record all variables, only used for Reduce strategy and distributed training.
For the Reduce strategy, we schedule Reduce Op on the different device and record the gradient variable name in var_name_on_devices_
, so it would only appear on only one device.
For the distributed training, the same as Reduce strategy, we schedule send_op
and recv_op
on the different device, the variable name would not appear on the different device also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that there are something that we can implement cleaner. But I guess we can do them as followup.
for (auto *var : program.Block(0).AllVars()) { | ||
all_vars[var->Name()] = var; | ||
all_vars_.emplace(var->Name(), var); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace
has difference semantics from []
? If not necessary, let's it keep it the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too much, just can avoid some non-necessary copy, but it's no difference here.
platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
||
if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builder_.get() != nullptr
-> builder_
platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
||
if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that GetVarDeviceID should probably be a method of a built graph or executor. This avoids making builder_
a private member. But I guess it's ok to leave it as TOOD for now.
balance_vars_[dev_id] += numel_sum; | ||
return dev_id; | ||
} | ||
|
||
std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change. Build() can only be called once? Do we want to clear "balanced_vars", "all_vars", etc at the beginning of Build()?
In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it. |
It's a good idea, I move them as a private member just because of I want to expose And can also fix #11593 . |
some acc test: Pass = 0, Elapsed = 41, Training performance = 146.812973 imgs/s, Train accuracy = 0.026151, Test accuracy = 0.010417
Pass = 1, Elapsed = 39, Training performance = 152.322642 imgs/s, Train accuracy = 0.041118, Test accuracy = 0.007292
Pass = 2, Elapsed = 39, Training performance = 153.547463 imgs/s, Train accuracy = 0.043750, Test accuracy = 0.008333 local=0,batch_size=80,trainers=2,pservers=2 Pass = 0, Elapsed = 127, Training performance = 47.504918 imgs/s, Train accuracy = 0.023849, Test accuracy = 0.009375
Pass = 1, Elapsed = 121, Training performance = 50.227834 imgs/s, Train accuracy = 0.024013, Test accuracy = 0.010417
Pass = 2, Elapsed = 120, Training performance = 50.489441 imgs/s, Train accuracy = 0.026974, Test accuracy = 0.010417
Pass = 3, Elapsed = 120, Training performance = 50.426222 imgs/s, Train accuracy = 0.028125, Test accuracy = 0.011458
|
Fixed #11143
After parallel bcast, this PR will improve performacne about 15% on vgg +flowers + 2trainers + 2pservers
overlap memcpy branch
develop branch
The improvement would be better on resnet, because the parameter size is smaller than vgg.