-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge selected rows with dynamic variable count #8023
Merge selected rows with dynamic variable count #8023
Conversation
…_merge_selectedrows
…_merge_selectedrows
…_merge_selectedrows
@@ -155,6 +158,11 @@ class ListenAndServOp : public framework::OperatorBase { | |||
} catch (std::exception &e) { | |||
LOG(ERROR) << "run sub program error " << e.what(); | |||
} | |||
|
|||
// TODO(Yancey1989): use an operator to reset sprase variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about add more comments about how to achieve sparse updating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I will add some details about remote sparse update in the dist desig doc.
@@ -141,10 +142,12 @@ class ListenAndServOp : public framework::OperatorBase { | |||
PADDLE_THROW("Can not find server side var"); | |||
} | |||
detail::DeserializeFromMessage(v.second, dev_ctx, var); | |||
if (var->IsType<framework::SelectedRows>()) { | |||
sparse_vars.push_back(var); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to reset sparse_vars
somewhere, or stop increase its size under some condition? Otherwise it looks like the sparse_vars
size will keep increasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse_vars
would a new one before start to receive vars for each mini-batch, but you're right, I think we could move sparse_vars
for this function block and reset it after each mini-batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/send_op.cc
Outdated
VLOG(3) << "sending " << ins[i] << " to " << epmap[i]; | ||
rpc_client->AsyncSendVariable(epmap[i], ctx, scope, ins[i]); | ||
} else { | ||
VLOG(3) << "don't send no-initialied variable: " << ins[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand why it's possible that the send OP will execute when its inputs are not initialized, is it a special thing for the select row OP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we split a var(selected_rows) into multiple vars(selected_rows) by split_selected_rows operator, not all the outputs have the value, for example:
The Tnput var(X):
X.rows = [1,2,5]
X.height = 10
X.value = {{1,1}, {2,2}, {3,3}}
attr.height_sections=[4, 4, 2]
After split_selected_rows:
The outputs vars(Out)
Out[0].rows = [1,2]
Out[0].height= 4
Out[0].value = {{1,1}, {2,2}}
Out[1].rows = [3]
Out[1].height = 4
Out[1].value = {{3,3}}
Out[2].rows = []
Out[2].height = 2
Out[2].value = {}
the value of output var(Out[2]) is empty, which means it hasn't called out[i].mutable_value().mutable_data()
to allocate memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -101,6 +101,9 @@ class ListenAndServOp : public framework::OperatorBase { | |||
|
|||
// TODO(typhoonzero): change this to a while_op for every cluster-batch. | |||
bool exit_flag = false; | |||
// Recored received sparse variables, so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recored -> Record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -155,9 +160,18 @@ class ListenAndServOp : public framework::OperatorBase { | |||
} catch (std::exception &e) { | |||
LOG(ERROR) << "run sub program error " << e.what(); | |||
} | |||
|
|||
// Reset the received sparse variables, the sum operator would not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不好意思,这里不是很懂,reset好像是把rows这个vector给重置了,但是comment里说“the sum operator would not sum the input selectedrows variables which rows is empty“,但重置之后的rows不就是空的吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为SumOp其实是在上面执行optimize sub program时执行的,这里置空是为了保证下一轮optimize的之前所有sparse variable的rows都为空,这样就只针对收到的sparse variable做sum了。
@@ -269,6 +271,7 @@ def _create_var_for_trainers(self, block, var, trainers): | |||
name="%s.trainer_%d" % (var.name, i), | |||
psersistable=var.persistable, | |||
dtype=var.dtype, | |||
type=var.type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we are merging splited SelectedRows
using concat_op
and it's not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum_op
implement the same feature, so I reuse it.
…_merge_selectedrows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to merge this first and fix the develop branch dist train bugs.
Fixed #7913