-
Notifications
You must be signed in to change notification settings - Fork 6.8k
rsp push and rsp pull for comm device, used in kvstore('device') #8732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @reminisce @rahul003 for further reviews
src/kvstore/comm.h
Outdated
int type = std::get<2>(sorted_key_attrs_[i]); | ||
NDArrayStorageType stype = std::get<3>(sorted_key_attrs_[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const?
src/kvstore/comm.h
Outdated
const_vars[i] = reduce[i].var(); | ||
} | ||
auto result = buf.merged; | ||
Engine::Get()->PushAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to ndarray.cc instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this should move into ndarray.cc
? I think it is fine here, i.e. push the operation into engine in comm.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend the ElementwiseSum
function in https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L574 to handle row-sparse cases?
src/kvstore/comm.h
Outdated
case gpu::kDevMask: { | ||
mxnet::ndarray::ElementwiseSum(rctx.get_stream<gpu>(), rsc, reduce, &out); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ctx.get_stream<gpu>()->Wait();
missing if use CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MXNET_USE_CUDA
is already in line 575
src/kvstore/comm.h
Outdated
mxnet::common::SparseRetainOpForwardRspWrapper<gpu>(rctx.get_stream<gpu>(), | ||
src_gpu, indices, kWriteTo, &temp); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is Stream->Wait()
missing?
tests/python/gpu/test_kvstore_gpu.py
Outdated
# single | ||
kv.init('a', mx.nd.zeros(shape, stype=stype)) | ||
# list | ||
kv.init(str_keys, [mx.nd.zeros(shape=shape, stype=stype)] * len(keys)) | ||
return kv | ||
|
||
|
||
@unittest.skip("Test fails intermittently. Temporarily disabled until fixed. Tracked at https://github.com/apache/incubator-mxnet/issues/8262") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be fixed in #8838
src/kvstore/comm.h
Outdated
const_vars[i] = reduce[i].var(); | ||
} | ||
auto result = buf.merged; | ||
Engine::Get()->PushAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend the ElementwiseSum
function in https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L574 to handle row-sparse cases?
tests/python/gpu/test_kvstore_gpu.py
Outdated
# single | ||
kv.init('a', mx.nd.zeros(shape, stype=stype)) | ||
# list | ||
kv.init(str_keys, [mx.nd.zeros(shape=shape, stype=stype)] * len(keys)) | ||
return kv | ||
|
||
|
||
def test_row_sparse_pull(): | ||
kv = init_kv_with_str('row_sparse') | ||
def test_row_sparse_pull(kv_type='local'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think nosetest will not run the code in __main__
but rather look for functions starting with test_
. Shall we test both local
and device
in test_row_sparse_pull
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, we should have a separate test for kv=device with rsp values on different contexts. Using gpu ctxes is fine, CI should have more than 1 GPUs.
src/kvstore/comm.h
Outdated
} else { | ||
CHECK_EQ(out->storage_type(), kRowSparseStorage) | ||
<< "BroadcastRowSparse expects row_sparse dst NDArray"; | ||
const bool is_diff_ctx = out->ctx() != src.ctx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming src is always on GPU?
If so, should we perform retain first before copying it to other devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src is not assumed to be on gpu. Actually src is always on cpu. As you can see in https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L233, src is local_[key]
. And local_[key]
is initialized to be on pinned_ctx_
which is always cpu, https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L152.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true at the beginning. But as soon as you push some gradients on GPU, it copies the weight from pinned_ctx to GPU. See
https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nonetheless, I think performing sparse retain before the copy makes more sense since the source array is usually very large.
@eric-haibin-lin Yes, I think |
Can resource_request also be added to ndarray.cc? Others may use ElementwiseSum as a black box. Also see https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L667 which request temp resource |
Got it. I think it is OK. Thanks for your reference ! |
src/common/utils.h
Outdated
@@ -215,6 +215,13 @@ void CheckFormatImpl(const RunContext &rctx, const NDArray &input, | |||
} | |||
|
|||
|
|||
template<typename xpu> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure functions in .h
are documented. Should add some description for CastStorageDispatch
too...
src/kvstore/comm.h
Outdated
CHECK_EQ(src.storage_type(), kRowSparseStorage) | ||
<< "BroadcastRowSparse expects row-sparse src NDArray"; | ||
|
||
bool is_same_rowid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some brief description explaining the optimization
src/kvstore/comm.h
Outdated
} else { | ||
CHECK_EQ(out->storage_type(), kRowSparseStorage) | ||
<< "BroadcastRowSparse expects row_sparse dst NDArray"; | ||
const bool is_diff_ctx = out->ctx() != src.ctx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nonetheless, I think performing sparse retain before the copy makes more sense since the source array is usually very large.
src/kvstore/comm.h
Outdated
if (is_diff_ctx) { | ||
CopyFromTo(src, &src_gpu, priority); | ||
} | ||
NDArray row_id_gpu = NDArray(row_id.shape(), out->ctx(), false, mshadow::kInt64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it still work if the user provide outputs on the cpu device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. More unittests are added.
tests/python/gpu/test_kvstore_gpu.py
Outdated
check_rsp_pull(kv, 1, [mx.gpu(0)]) | ||
check_rsp_pull(kv, 4, [mx.gpu(i//2) for i in range(4)]) | ||
check_rsp_push_pull('local') | ||
check_rsp_push_pull('device') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have test cases that at least covers the following cases for kvstore=device
:
push cpu then rsp_pull cpu
push gpu then rsp_pull gpu
src/kvstore/utils.h
Outdated
}) | ||
} | ||
|
||
void CopyRetainedRows(RunContext rctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add brief comment
tests/python/gpu/test_kvstore_gpu.py
Outdated
check_rsp_pull(kv, 4, [mx.gpu(i//2) for i in range(4)]) | ||
check_rsp_pull(kv, 4, [mx.cpu(i) for i in range(4)]) | ||
|
||
check_rsp_push_pull('local') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have the test case where the same row_id is used for rsp_pull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
src/kvstore/comm.h
Outdated
CHECK_EQ(src.storage_type(), kRowSparseStorage) | ||
<< "BroadcastRowSparse expects row-sparse src NDArray"; | ||
|
||
// whether the indices are the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is code is duplicated in comm.h
and kvstore_local.h
. Shall we move it to util.h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
GPU unique. Same row id for every gpu.
|
Replace
1 gpu, 2 gpu, 4 gpu, 8 gpu, |
1 gpu, 2 gpu, 4 gpu, 8 gpu, |
src/kvstore/utils.cc
Outdated
const TBlob& rowid_i = val_rowids[i].second.data(); | ||
if (rowid_i.dptr<IType>() != first_dptr | ||
|| rowid_i.Size() != first_size) { | ||
is_same_rowid = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can return false directly if they don't match
src/ndarray/ndarray.cc
Outdated
}, ret.ctx(), const_vars, {ret.var(), rsc.var}, | ||
FnProperty::kNormal, priority, PROFILER_MESSAGE("RowSparseElementwiseSum")); | ||
} else { | ||
LOG(FATAL) << "Not implemented for storage_type " << stype; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< common::stype_string(stype);
|
||
check_row_sparse_pull(kv, 1, mx.gpu(0)) | ||
check_row_sparse_pull(kv, 4, mx.gpu(0)) | ||
check_rsp_pull(kv, 1, [mx.gpu(0)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add support for passing a list of values with a single rowid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
for i in range(count): | ||
row_id = np.random.randint(num_rows, size=num_rows) | ||
row_ids.append(mx.nd.array(row_id, dtype='int64')) | ||
row_ids_to_pull = row_ids[0] if (len(row_ids) == 1 or is_same_rowid) else row_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test here for single rowid with multiple vals
Anything to address? @reminisce |
src/kvstore/comm.h
Outdated
} else { | ||
LOG(FATAL) << "storage type " << stype << " not implemented for device yet"; | ||
} | ||
sorted_key_attrs_.push_back(std::make_tuple(key, shape, dtype, stype)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using emplace_back(key, shape, dtype, stype)
can avoid constructing temporary tuple object.
src/kvstore/comm.h
Outdated
@@ -681,8 +749,9 @@ class CommDevice : public Comm { | |||
} | |||
for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) { | |||
int key = std::get<0>(sorted_key_attrs_[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int
src/kvstore/comm.h
Outdated
@@ -681,8 +749,9 @@ class CommDevice : public Comm { | |||
} | |||
for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) { | |||
int key = std::get<0>(sorted_key_attrs_[i]); | |||
TShape s = std::get<1>(sorted_key_attrs_[i]); | |||
TShape shape = std::get<1>(sorted_key_attrs_[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const TShape&
src/kvstore/comm.h
Outdated
@@ -681,8 +749,9 @@ class CommDevice : public Comm { | |||
} | |||
for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) { | |||
int key = std::get<0>(sorted_key_attrs_[i]); | |||
TShape s = std::get<1>(sorted_key_attrs_[i]); | |||
TShape shape = std::get<1>(sorted_key_attrs_[i]); | |||
int type = std::get<2>(sorted_key_attrs_[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int
src/kvstore/utils.cc
Outdated
bool CheckSameRowid( | ||
const std::vector<std::pair<NDArray*, NDArray>>& val_rowids) { | ||
MSHADOW_TYPE_SWITCH(val_rowids[0].second.dtype(), IType, { | ||
const TBlob& rowid_first = val_rowids[0].second.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing data()
outside engine is dangerous. We can compare NDArray::ptr_ and offset instead.
python/mxnet/kvstore.py
Outdated
row_ids = [row_ids] | ||
assert(isinstance(row_ids, list)), \ | ||
"row_ids should be NDArray or list of NDArray" | ||
out_val = out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer first_out
to out_val
. I also recommend document the optimization upfront instead of and the end of the fucntion:
"When there is only one row_id, we can invoke KVStoreRowSparsePull just once and broadcast the result to all the rest of outputs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments are added into doc string
python/mxnet/kvstore.py
Outdated
"row_ids should be NDArray or list of NDArray" | ||
out_val = out | ||
# whether row_ids are the same | ||
is_same_rowid = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer single_rowid
to is_same_rowid
…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore
…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore
…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore
…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore
Description
Although the added test can pass on my machine(centos7 and 8G 1080 GPU), the test in master is skipped now waiting to be fixed.
cc @eric-haibin-lin
Checklist
Essentials
make lint
)Changes
Comments