Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge selected rows with dynamic variable count #8023

Merged

Conversation

Yancey1989
Copy link
Contributor

Fixed #7913

@Yancey1989 Yancey1989 changed the title dynamic send/recv selected rows Merge selected rows with dynamic variable count Feb 1, 2018
@@ -155,6 +158,11 @@ class ListenAndServOp : public framework::OperatorBase {
} catch (std::exception &e) {
LOG(ERROR) << "run sub program error " << e.what();
}

// TODO(Yancey1989): use an operator to reset sprase variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about add more comments about how to achieve sparse updating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I will add some details about remote sparse update in the dist desig doc.

@@ -141,10 +142,12 @@ class ListenAndServOp : public framework::OperatorBase {
PADDLE_THROW("Can not find server side var");
}
detail::DeserializeFromMessage(v.second, dev_ctx, var);
if (var->IsType<framework::SelectedRows>()) {
sparse_vars.push_back(var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reset sparse_vars somewhere, or stop increase its size under some condition? Otherwise it looks like the sparse_vars size will keep increasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse_vars would a new one before start to receive vars for each mini-batch, but you're right, I think we could move sparse_vars for this function block and reset it after each mini-batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

VLOG(3) << "sending " << ins[i] << " to " << epmap[i];
rpc_client->AsyncSendVariable(epmap[i], ctx, scope, ins[i]);
} else {
VLOG(3) << "don't send no-initialied variable: " << ins[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand why it's possible that the send OP will execute when its inputs are not initialized, is it a special thing for the select row OP?

Copy link
Contributor Author

@Yancey1989 Yancey1989 Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we split a var(selected_rows) into multiple vars(selected_rows) by split_selected_rows operator, not all the outputs have the value, for example:

The Tnput var(X):

X.rows = [1,2,5]
X.height = 10
X.value = {{1,1}, {2,2}, {3,3}}
attr.height_sections=[4, 4, 2]

After split_selected_rows:
The outputs vars(Out)

Out[0].rows = [1,2]
Out[0].height= 4
Out[0].value = {{1,1}, {2,2}}

Out[1].rows = [3]
Out[1].height = 4
Out[1].value = {{3,3}}

Out[2].rows = []
Out[2].height = 2
Out[2].value = {}

the value of output var(Out[2]) is empty, which means it hasn't called out[i].mutable_value().mutable_data() to allocate memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -101,6 +101,9 @@ class ListenAndServOp : public framework::OperatorBase {

// TODO(typhoonzero): change this to a while_op for every cluster-batch.
bool exit_flag = false;
// Recored received sparse variables, so that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recored -> Record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -155,9 +160,18 @@ class ListenAndServOp : public framework::OperatorBase {
} catch (std::exception &e) {
LOG(ERROR) << "run sub program error " << e.what();
}

// Reset the received sparse variables, the sum operator would not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,这里不是很懂,reset好像是把rows这个vector给重置了,但是comment里说“the sum operator would not sum the input selectedrows variables which rows is empty“,但重置之后的rows不就是空的吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为SumOp其实是在上面执行optimize sub program时执行的,这里置空是为了保证下一轮optimize的之前所有sparse variable的rows都为空,这样就只针对收到的sparse variable做sum了。

@@ -269,6 +271,7 @@ def _create_var_for_trainers(self, block, var, trainers):
name="%s.trainer_%d" % (var.name, i),
psersistable=var.persistable,
dtype=var.dtype,
type=var.type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we are merging splited SelectedRows using concat_op and it's not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum_op implement the same feature, so I reuse it.

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to merge this first and fix the develop branch dist train bugs.

@Yancey1989 Yancey1989 merged commit caf9a09 into PaddlePaddle:develop Feb 11, 2018
@Yancey1989 Yancey1989 deleted the dynamic_merge_selectedrows branch February 11, 2018 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants