rsp push and rsp pull for comm device, used in kvstore('device') #8732

ZiyueHuang · 2017-11-20T19:52:04Z

Description

Although the added test can pass on my machine(centos7 and 8G 1080 GPU), the test in master is skipped now waiting to be fixed.

cc @eric-haibin-lin

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated. For new C++ functions in header files, their functionalities and arguments are well-documented.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

rsp push and rsp pull for comm device

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

eric-haibin-lin

ping @reminisce @rahul003 for further reviews

eric-haibin-lin · 2017-11-27T21:16:45Z

src/kvstore/comm.h

      int type = std::get<2>(sorted_key_attrs_[i]);
+      NDArrayStorageType stype = std::get<3>(sorted_key_attrs_[i]);


nit: const?

eric-haibin-lin · 2017-11-27T21:19:42Z

src/kvstore/comm.h

+        const_vars[i] = reduce[i].var();
+      }
+      auto result = buf.merged;
+      Engine::Get()->PushAsync(


Should this be moved to ndarray.cc instead?

Why this should move into ndarray.cc? I think it is fine here, i.e. push the operation into engine in comm.h.

Can we extend the ElementwiseSum function in https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L574 to handle row-sparse cases?

eric-haibin-lin · 2017-11-27T21:20:32Z

src/kvstore/comm.h

+            case gpu::kDevMask: {
+              mxnet::ndarray::ElementwiseSum(rctx.get_stream<gpu>(), rsc, reduce, &out);
+              break;
+            }


Is ctx.get_stream<gpu>()->Wait(); missing if use CUDA?

MXNET_USE_CUDA is already in line 575

eric-haibin-lin · 2017-11-27T21:28:01Z

src/kvstore/comm.h

+                mxnet::common::SparseRetainOpForwardRspWrapper<gpu>(rctx.get_stream<gpu>(),
+                  src_gpu, indices, kWriteTo, &temp);
+                break;
+              }


is Stream->Wait() missing?

eric-haibin-lin · 2017-11-27T22:54:45Z

tests/python/gpu/test_kvstore_gpu.py

    # single
    kv.init('a', mx.nd.zeros(shape, stype=stype))
    # list
    kv.init(str_keys, [mx.nd.zeros(shape=shape, stype=stype)] * len(keys))
    return kv


-@unittest.skip("Test fails intermittently. Temporarily disabled until fixed. Tracked at https://github.com/apache/incubator-mxnet/issues/8262")


The test should be fixed in #8838

eric-haibin-lin · 2017-11-28T17:57:30Z

src/kvstore/comm.h

+        const_vars[i] = reduce[i].var();
+      }
+      auto result = buf.merged;
+      Engine::Get()->PushAsync(


Can we extend the ElementwiseSum function in https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L574 to handle row-sparse cases?

eric-haibin-lin · 2017-11-28T18:00:06Z

tests/python/gpu/test_kvstore_gpu.py

    # single
    kv.init('a', mx.nd.zeros(shape, stype=stype))
    # list
    kv.init(str_keys, [mx.nd.zeros(shape=shape, stype=stype)] * len(keys))
    return kv


-def test_row_sparse_pull():
-    kv = init_kv_with_str('row_sparse')
+def test_row_sparse_pull(kv_type='local'):


I think nosetest will not run the code in __main__ but rather look for functions starting with test_. Shall we test both local and device in test_row_sparse_pull?

On second thought, we should have a separate test for kv=device with rsp values on different contexts. Using gpu ctxes is fine, CI should have more than 1 GPUs.

eric-haibin-lin · 2017-11-28T18:21:07Z

src/kvstore/comm.h

+      } else {
+        CHECK_EQ(out->storage_type(), kRowSparseStorage)
+                 << "BroadcastRowSparse expects row_sparse dst NDArray";
+        const bool is_diff_ctx = out->ctx() != src.ctx();


Are we assuming src is always on GPU?
If so, should we perform retain first before copying it to other devices?

src is not assumed to be on gpu. Actually src is always on cpu. As you can see in https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L233, src is local_[key]. And local_[key] is initialized to be on pinned_ctx_ which is always cpu, https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L152.

That's true at the beginning. But as soon as you push some gradients on GPU, it copies the weight from pinned_ctx to GPU. See
https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_local.h#L173

Nonetheless, I think performing sparse retain before the copy makes more sense since the source array is usually very large.

ZiyueHuang · 2017-11-28T19:06:24Z

@eric-haibin-lin Yes, I think ElementwiseSum in ndarray.cc can be extened to handle rsp ndarrays, but this line, Resource rsc = ResourceManager::Get()->Request(rctx.ctx, ResourceRequest(ResourceRequest::kTempSpace));, should be inserted into ElementwiseSum since temp space is needed for rsp cases. Is this ok?

eric-haibin-lin · 2017-11-28T19:09:37Z

Can resource_request also be added to ndarray.cc? Others may use ElementwiseSum as a black box. Also see https://github.com/apache/incubator-mxnet/blob/master/src/ndarray/ndarray.cc#L667 which request temp resource

ZiyueHuang · 2017-11-28T19:14:28Z

Got it. I think it is OK. Thanks for your reference !

eric-haibin-lin · 2017-11-30T20:12:03Z

src/common/utils.h

@@ -215,6 +215,13 @@ void CheckFormatImpl(const RunContext &rctx, const NDArray &input,
 }


+template<typename xpu>


Let's make sure functions in .h are documented. Should add some description for CastStorageDispatch too...

eric-haibin-lin · 2017-11-30T20:15:20Z

src/kvstore/comm.h

+    CHECK_EQ(src.storage_type(), kRowSparseStorage)
+      << "BroadcastRowSparse expects row-sparse src NDArray";
+
+    bool is_same_rowid = true;


Please add some brief description explaining the optimization

eric-haibin-lin · 2017-11-30T20:18:55Z

src/kvstore/comm.h

+      } else {
+        CHECK_EQ(out->storage_type(), kRowSparseStorage)
+                 << "BroadcastRowSparse expects row_sparse dst NDArray";
+        const bool is_diff_ctx = out->ctx() != src.ctx();


Nonetheless, I think performing sparse retain before the copy makes more sense since the source array is usually very large.

eric-haibin-lin · 2017-11-30T20:20:42Z

src/kvstore/comm.h

+        if (is_diff_ctx) {
+          CopyFromTo(src, &src_gpu, priority);
+        }
+        NDArray row_id_gpu = NDArray(row_id.shape(), out->ctx(), false, mshadow::kInt64);


Does it still work if the user provide outputs on the cpu device?

Yes. More unittests are added.

eric-haibin-lin · 2017-11-30T20:23:11Z

tests/python/gpu/test_kvstore_gpu.py

+        check_rsp_pull(kv, 1, [mx.gpu(0)])
+        check_rsp_pull(kv, 4, [mx.gpu(i//2) for i in range(4)])
+    check_rsp_push_pull('local')
+    check_rsp_push_pull('device')


I think we should have test cases that at least covers the following cases for kvstore=device:
push cpu then rsp_pull cpu
push gpu then rsp_pull gpu

eric-haibin-lin · 2017-12-02T18:34:52Z

src/kvstore/utils.h

+  })
+}
+
+void CopyRetainedRows(RunContext rctx,


Please add brief comment

eric-haibin-lin · 2017-12-02T18:51:53Z

tests/python/gpu/test_kvstore_gpu.py

+        check_rsp_pull(kv, 4, [mx.gpu(i//2) for i in range(4)])
+        check_rsp_pull(kv, 4, [mx.cpu(i) for i in range(4)])
+
+    check_rsp_push_pull('local')


Do we have the test case where the same row_id is used for rsp_pull?

eric-haibin-lin · 2017-12-02T18:53:17Z

src/kvstore/comm.h

+    CHECK_EQ(src.storage_type(), kRowSparseStorage)
+      << "BroadcastRowSparse expects row-sparse src NDArray";
+
+    // whether the indices are the same


Is code is duplicated in comm.h and kvstore_local.h. Shall we move it to util.h?

ZiyueHuang · 2017-12-17T16:33:43Z

GPU unique. Same row id for every gpu.

batch size	Device	samples/sec
16384	1 gpu	2.8 M
16384 * 2	2 gpu	3.3 M
16384 * 4	4 gpu	4.3 M
16384 * 8	8 gpu	5.6 M

ZiyueHuang · 2017-12-17T17:18:39Z

profile.zip

Replace mx.nd.waitall with wait_to_read()

batch size	Device	samples/sec
16384	1 gpu	3 M
16384 * 2	2 gpu	3.7 M
16384 * 4	4 gpu	5 M
16384 * 8	8 gpu	6.5 M

1 gpu,

2 gpu,

4 gpu,

8 gpu,

ZiyueHuang · 2017-12-17T18:28:51Z

export MXNET_GPU_TEMP_COPY=4

batch size	Device	samples/sec
16384	1 gpu	3.4 M
16384 * 2	2 gpu	4.4 M
16384 * 4	4 gpu	6.7 M
16384 * 8	8 gpu	9.1 M

1 gpu,

2 gpu,

4 gpu,

8 gpu,

profile.zip

eric-haibin-lin · 2017-12-20T05:56:19Z

src/kvstore/utils.cc

+      const TBlob& rowid_i = val_rowids[i].second.data();
+      if (rowid_i.dptr<IType>() != first_dptr
+          || rowid_i.Size() != first_size) {
+        is_same_rowid = false;


nit: we can return false directly if they don't match

eric-haibin-lin · 2017-12-20T06:02:56Z

src/ndarray/ndarray.cc

+      }, ret.ctx(), const_vars, {ret.var(), rsc.var},
+    FnProperty::kNormal, priority, PROFILER_MESSAGE("RowSparseElementwiseSum"));
+  } else {
+    LOG(FATAL) << "Not implemented for storage_type " << stype;


<< common::stype_string(stype);

eric-haibin-lin · 2017-12-22T04:38:11Z

tests/python/gpu/test_kvstore_gpu.py


-    check_row_sparse_pull(kv, 1, mx.gpu(0))
-    check_row_sparse_pull(kv, 4, mx.gpu(0))
+        check_rsp_pull(kv, 1, [mx.gpu(0)])


Can we also add support for passing a list of values with a single rowid?

ZiyueHuang · 2017-12-22T10:35:23Z

tests/python/gpu/test_kvstore_gpu.py

+                for i in range(count):
+                    row_id = np.random.randint(num_rows, size=num_rows)
+                    row_ids.append(mx.nd.array(row_id, dtype='int64'))
+            row_ids_to_pull = row_ids[0] if (len(row_ids) == 1 or is_same_rowid) else row_ids


test here for single rowid with multiple vals

ZiyueHuang · 2017-12-31T16:33:59Z

Anything to address? @reminisce

reminisce · 2018-01-01T04:19:56Z

src/kvstore/comm.h

-    } else {
-      LOG(FATAL) << "storage type " << stype << " not implemented for device yet";
-    }
+    sorted_key_attrs_.push_back(std::make_tuple(key, shape, dtype, stype));


Using emplace_back(key, shape, dtype, stype) can avoid constructing temporary tuple object.

reminisce · 2018-01-01T04:35:02Z

src/kvstore/comm.h

@@ -681,8 +749,9 @@ class CommDevice : public Comm {
    }
    for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) {
      int key  = std::get<0>(sorted_key_attrs_[i]);


reminisce · 2018-01-01T04:36:14Z

src/kvstore/comm.h

@@ -681,8 +749,9 @@ class CommDevice : public Comm {
    }
    for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) {
      int key  = std::get<0>(sorted_key_attrs_[i]);
-      TShape s = std::get<1>(sorted_key_attrs_[i]);
+      TShape shape = std::get<1>(sorted_key_attrs_[i]);


const TShape&

reminisce · 2018-01-01T04:36:22Z

src/kvstore/comm.h

@@ -681,8 +749,9 @@ class CommDevice : public Comm {
    }
    for (size_t i = 0; i < sorted_key_attrs_.size(); ++i) {
      int key  = std::get<0>(sorted_key_attrs_[i]);
-      TShape s = std::get<1>(sorted_key_attrs_[i]);
+      TShape shape = std::get<1>(sorted_key_attrs_[i]);
      int type = std::get<2>(sorted_key_attrs_[i]);


eric-haibin-lin · 2018-01-02T21:46:53Z

src/kvstore/utils.cc

+bool CheckSameRowid(
+    const std::vector<std::pair<NDArray*, NDArray>>& val_rowids) {
+  MSHADOW_TYPE_SWITCH(val_rowids[0].second.dtype(), IType, {
+    const TBlob& rowid_first = val_rowids[0].second.data();


accessing data() outside engine is dangerous. We can compare NDArray::ptr_ and offset instead.

eric-haibin-lin · 2018-01-11T21:44:54Z

python/mxnet/kvstore.py

+            row_ids = [row_ids]
+        assert(isinstance(row_ids, list)), \
+            "row_ids should be NDArray or list of NDArray"
+        out_val = out


I'd prefer first_out to out_val. I also recommend document the optimization upfront instead of and the end of the fucntion:
"When there is only one row_id, we can invoke KVStoreRowSparsePull just once and broadcast the result to all the rest of outputs"

comments are added into doc string

eric-haibin-lin · 2018-01-11T21:50:35Z

python/mxnet/kvstore.py

+            "row_ids should be NDArray or list of NDArray"
+        out_val = out
+        # whether row_ids are the same
+        is_same_rowid = False


prefer single_rowid to is_same_rowid

…che#8732) * comm device for rsp push and pull * update * update test * optimization for same row_ids * add stream->wait * remove using space * fix race of rsc and extend ElementwiseSum to rsp cases * add log fatal in ElementwiseSum * direct copy rows if full rsp and put all outputs on ctx of src * trigger * fix * simplify copy * move check same rowids to utils and add test for same rowids case * remove direct copy row by row * fix checkSameRowid * gpu unique impl draft * unique * update * fix windows build * trigger windows build * support single rowid with multiple vals * address comments * check same row_ids and copy in fronted * revise names and disable test for local kvstore

ZiyueHuang added 2 commits November 21, 2017 03:45

comm device for rsp push and pull

fb0077e

update

5288bc6

eric-haibin-lin self-assigned this Nov 21, 2017

ZiyueHuang added 3 commits November 27, 2017 01:19

resolve conflict

865a117

update test

296f122

optimization for same row_ids

8b8c14f

ZiyueHuang changed the title ~~[WIP] rsp push and rsp pull for comm device~~ rsp push and rsp pull for comm device Nov 27, 2017

ZiyueHuang changed the title ~~rsp push and rsp pull for comm device~~ rsp push and rsp pull for comm device, used in kvstore('device') Nov 27, 2017

eric-haibin-lin reviewed Nov 27, 2017

View reviewed changes

ZiyueHuang added 3 commits November 28, 2017 16:57

resolve conflict

4fb29ae

add stream->wait

c37ee41

remove using space

96c7a2f

eric-haibin-lin reviewed Nov 28, 2017

View reviewed changes

ZiyueHuang mentioned this pull request Nov 29, 2017

fix race when temp space is used in copy & fix instance overwrite in g2c #8867

Merged

5 tasks

ZiyueHuang added 2 commits November 30, 2017 01:03

fix race of rsc and extend ElementwiseSum to rsp cases

0990c69

add log fatal in ElementwiseSum

32f25c8

eric-haibin-lin reviewed Nov 30, 2017

View reviewed changes

ZiyueHuang mentioned this pull request Dec 1, 2017

A todo list for usability improvement for sparse tensor #8902

Closed

ZiyueHuang added 6 commits December 1, 2017 21:21

resolve

079981f

direct copy rows if full rsp and put all outputs on ctx of src

0e4a1c6

Merge remote-tracking branch 'upstream/master' into comm_device

6a18d83

trigger

31bfad8

fix

910d4fa

simplify copy

9dca449

eric-haibin-lin reviewed Dec 2, 2017

View reviewed changes

move check same rowids to utils and add test for same rowids case

20b28eb

ZiyueHuang added 4 commits December 12, 2017 15:54

Merge remote-tracking branch 'upstream/master' into comm_device

b0d53ad

gpu unique impl draft

9e06a08

unique

d84bf47

Merge remote-tracking branch 'upstream/master' into comm_device

5f55545

update

f16faa1

eric-haibin-lin reviewed Dec 20, 2017

View reviewed changes

ZiyueHuang added 3 commits December 20, 2017 11:46

Merge remote-tracking branch 'upstream/master' into comm_device

72e752d

fix windows build

5695b52

trigger windows build

66ae47d

eric-haibin-lin reviewed Dec 22, 2017

View reviewed changes

ZiyueHuang added 2 commits December 22, 2017 09:55

Merge remote-tracking branch 'upstream/master' into comm_device

8179fab

support single rowid with multiple vals

134b98f

ZiyueHuang commented Dec 22, 2017

View reviewed changes

reminisce reviewed Jan 1, 2018

View reviewed changes

address comments

c96b158

eric-haibin-lin reviewed Jan 2, 2018

View reviewed changes

check same row_ids and copy in fronted

1b09d09

eric-haibin-lin reviewed Jan 11, 2018

View reviewed changes

revise names and disable test for local kvstore

e0a68c4

eric-haibin-lin merged commit 786e376 into apache:master Jan 15, 2018

cbalioglu mentioned this pull request Jan 17, 2018

Rename kvstore/utils.* to kvstore/kvstore_utils.* #9463

Merged

3 tasks

ZiyueHuang deleted the comm_device branch January 30, 2018 11:30

		int type = std::get<2>(sorted_key_attrs_[i]);
		NDArrayStorageType stype = std::get<3>(sorted_key_attrs_[i]);

		@@ -215,6 +215,13 @@ void CheckFormatImpl(const RunContext &rctx, const NDArray &input,
		}


		template<typename xpu>

rsp push and rsp pull for comm device, used in kvstore('device') #8732

rsp push and rsp pull for comm device, used in kvstore('device') #8732

Conversation

ZiyueHuang commented Nov 20, 2017

Description

Checklist

Essentials

Changes

Comments

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang Nov 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang commented Nov 28, 2017 • edited Loading

eric-haibin-lin commented Nov 28, 2017

ZiyueHuang commented Nov 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang commented Dec 17, 2017 • edited Loading

ZiyueHuang commented Dec 17, 2017 • edited Loading

ZiyueHuang commented Dec 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang commented Dec 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang Nov 28, 2017 •

edited

Loading

ZiyueHuang commented Nov 28, 2017 •

edited

Loading

ZiyueHuang commented Dec 17, 2017 •

edited

Loading

ZiyueHuang commented Dec 17, 2017 •

edited

Loading