Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cpp parallel executor #9080

Merged
merged 163 commits into from
Mar 30, 2018
Merged

Conversation

reyoung
Copy link
Collaborator

@reyoung reyoung commented Mar 14, 2018

I just add a dependency engine to parse the dependencies of operators. There are still a lot of jobs need to be done.

  1. Complete Broadcast parameters.
  2. Use thread pool to invoke operator parallelly.
  3. Complete NCCL AllReduce OpHandle in this implementation.
  4. Add fetch methods.

I just use VarHandle and OpHandle to parse Program as a SSA form graph. A variable is assigned by only one OpHandle. When all inputs of OpHandle is ready, the OpHandle can be run.

The speed of ResNeXt152 is

Number of GPUs 1 2 3 4
Image/Sec 18.639 27.8863 39.3787 52.9688
Speed Up N/A 1.4961264 2.11270454 2.84182628

opt = fluid.optimizer.SGDOptimizer()
opt.minimize(avg_cost)

# change Executor -> ParallelExecutor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can let the user still use the Executor interface, and add an optional argument "gpu_list", and underlying if there are multiple GPUs available (either len(gpu_list) > 0, or gpu_list == None and multiple GPUs initialized), create and return the parallel executor instance.

// e.g. sgd should wait for allreduce to be finished
CallBack->BeforeOp(op);

op->Run(*local_scope, place_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the reason we need callback is ParallelExecutor will call Executor::Run, but need to be notified before and after each OP::Run. Do we even need the Executor implementation anymore? Maybe we can consolidate them into a single executor, so that we don't need the callback anymore.

And it will be easier for the Python side, Python always create the same executor.

std::vector<OpHandle *> to_run;
for (auto *var : to_remove) {
for (auto *op : var->pending_ops_) {
if (var->name_ == "mean_0.tmp_0@GRAD") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this special case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just debug code... Sorry

struct OpHandle {
std::vector<VarHandle *> inputs_;
std::vector<VarHandle *> outputs_;
platform::DeviceContext *dev_ctx_;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add framework::Scope* scope_?

}

std::vector<LoDTensor> ParallelExecutor::Run(
const std::vector<std::string> &fetch_tensors) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instantiate Variables here?

Scope* local_scope = scope;
if (create_vars) {
if (create_local_scope) {
local_scope = &scope->NewScope();
for (auto& var : block.AllVars()) {
if (var->Name() == framework::kEmptyVarName) {
continue;
}
if (var->Persistable()) {
auto* ptr = scope->Var(var->Name());
CreateTensor(ptr, var->GetType());
VLOG(3) << "Create Variable " << var->Name()
<< " global, which pointer is " << ptr;
} else {
auto* ptr = local_scope->Var(var->Name());
CreateTensor(ptr, var->GetType());
VLOG(3) << "Create Variable " << var->Name()
<< " locally, which pointer is " << ptr;
}
}
} else {
for (auto& var : block.AllVars()) {
auto* ptr = local_scope->Var(var->Name());
CreateTensor(ptr, var->GetType());
VLOG(3) << "Create variable " << var->Name() << ", which pointer is "
<< ptr;
}
} // if (create_local_scope)
} // if (create_vars)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We need to instantiate variables here. We might extract this routine to a global function.

}
};

member_->pool_.Run(op_run);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should add a callback after we push operator run job to memory pool. In this callback, we change pending_var state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pending_var state has been changed during L404-L406

@chengduoZH chengduoZH self-requested a review March 15, 2018 08:48
member_->local_scopes_.size() != 1) { // Is CUDA
BuildNCCLCommunicator();
BCastParamsToGPUs(startup_program);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not initialize parameters on the respective devices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since randomize result might not same when seed = 0

void RunOp(std::unordered_map<VarHandleBase*, bool>& pending_vars,
OpHandle* op) const;

void PolishGraphToSupportDataHarzaeds() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataHarzaeds --> DataHazards

result.vars_.resize(places_.size());

bool is_forwarding = true;
for (auto *op : program.Block(0).AllOps()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since the operators in a sub-block will be executed by a control flow operator, e.g., While. The behaviour between control flow operators and computational operators should be same.

Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finished my review. @chengduoZH Let's verify the correctness and speed of transfomer and resnext. If they are OK, let's merge it soon so that everyone can start improving it

namespace details {

struct FetchOpHandle : public OpHandleBase {
FeedFetchList *data_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a "private:" then?

for (auto *op : program.Block(0).AllOps()) {
bool change_forward = false;
if (!is_forwarding) {
// FIXME(yy): Do not hard code like this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can you add your reply as comments in the code?

@@ -79,7 +79,18 @@ void* GPUAllocator::Alloc(size_t& index, size_t size) {
// if size is 0. We just make sure it does.
if (size <= 0) return nullptr;
void* p;
int prev_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest this to be a PADDLE_ENFORCE. if the current behavior works. Otherwise, reader will think the Allocator currently works on multiple GPUs.

PADDLE_THROW("Nobody should wait FetchOp. Unexpceted Error");
}

void FetchOpHandle::WaitAndMergeCPUTensors() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean really wait on anything? Maybe just MergeCPUTensors?

// FIXME: Currently ScaleLossGradOp only use device_count as scale
// factor. So it does not depend on any other operators.
// VarHandle *loss = GetVarHandle(loss_var_name, place);
// loss->pending_ops_.emplace_back(op_handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this op doesn't not depend on anything, when will it be scheduled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be run at the first time.

*
* https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_after_read_(WAR)
*/
static void PolishGraphToSupportDataHazards(SSAGraph *graph);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more comments to describe when does data hazard happens?

std::unordered_set<OpHandleBase *> ready_ops;

auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) {
pending_vars.insert(&var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be skipped if generated_op_ is nullptr?

@@ -29,7 +29,7 @@ void FileReader::ReadNext(std::vector<LoDTensor> *out) {

PADDLE_ENFORCE_EQ(actual.size(), expect.size());
for (int j = 0; j < actual.size(); ++j) {
PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);
// PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this line?

// Create local scopes.
for (auto &scope : local_scopes_) {
auto &local_scope = scope->NewScope();
*scope->Var("@TMP_SCOPE@")->GetMutable<Scope *>() = &local_scope;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all the scopes using the same var name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Op::Run, there are some temporary variables will be created in local scopes. So, here just use Variable @TMP_SCOPE@ to holds these temporary variables. They will be destroied after a period.

auto *dep_var = new DummyVarHandle();
read_op->AddOutput(dep_var);
write_op->AddInput(dep_var);
graph->dep_vars_.emplace(dep_var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called data_hazard_vars_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments have been added.

panyx0718
panyx0718 previously approved these changes Mar 29, 2018
@reyoung reyoung merged commit fa21436 into PaddlePaddle:develop Mar 30, 2018
@reyoung reyoung deleted the cpp_parallel_executor branch March 30, 2018 02:44
@chengduoZH chengduoZH added the parallel_exe parallel executor label Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallel_exe parallel executor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants