Add model-parallel MNIST example #98

levelfour · 2017-07-10T10:08:47Z

This PR adds some user-friendly interfaces to implement model-parallel neural nets, with model-parallel MNIST example.

New Features

MultiNodeChainGroup (`chainermn.MultiNodeChainGroup`)

This variant of chainer.ChainList represents multiple connected component of the entire computational graph.
In case of multi-node computation, computational graphs often become non-connected.
Each of connected component is represented by chainer.Chain, and MultiNodeChainGroup combines them.

new utility function `chainermn.functions.pseudo_connect`

This function is used when we want "delegate_variable" to imitate the other variable.
This kind of pathological motivation occurs in multi node environment.
Please see the document of this function for detail.

empty dataset (`chainermn.datasets.get_empty_dataset`)

It is used for the model with no actual inputs, rather receiving inputs from the other machine.

keisukefukuda · 2017-07-28T08:21:17Z

README.md

@@ -38,7 +38,7 @@ Please refer to the [installation guide](https://chainermn.readthedocs.io/en/lat
 You can invoke MNIST example with four workers by the following command:

 ```
-mpiexec -n 4 python examples/mnist/train_mnist.py
+mpiexec -n 4 python examples/mnist/train_mnist_data_parallel.py


I'm afraid we should not change this line.
Data parallel model is an advanced feature and users should first see the basic example of MNIST.

I thought it's confusing if train_mnist.py and train_mnist_model_parallel.py exist at the same time, so renamed the original train_mnist.py to train_mnist_data_parallel.py, in order to explicitly assert this example is for data parallel.

keisukefukuda · 2017-07-28T08:33:58Z

chainermn/link.py

+            import chainermn
+
+
+            class SimpleModelInst(chainer.Chain):


Do we think of a better name? Inst sounds like an instance of a class

Fixed. Thank you.

keisukefukuda · 2017-07-28T08:42:43Z

examples/mnist/train_mnist_model_parallel.py

+#!/usr/bin/env python
+# coding: utf-8
+
+import argparse


According to PEP8,

Imports should be grouped in the following order: standard library imports related third party imports local application/library specific imports You should put a blank line between each group of imports.

https://www.python.org/dev/peps/pep-0008/#imports

So we need a blank line after argparse.

Am not sure if we need a blank line between Chainer and ChainerMN imports.

keisukefukuda · 2017-07-28T08:44:32Z

tests/datasets_tests/test_empty_dataset.py

@@ -0,0 +1,29 @@
+import numpy as np


unittest should be the first because it's a standard library, while numpy is a third party lib.

keisukefukuda · 2017-07-28T08:45:07Z

tests/functions_tests/test_point_to_point_communication.py

@@ -6,6 +6,7 @@
 import chainer.testing.attr
 import chainermn
 import chainermn.functions
+import copy


keisukefukuda · 2017-07-28T08:47:22Z

tests/test_link.py

+                err = model()
+                err.backward()
+
+    def test_cross_model(self):


crossing_model ?

iwiwi · 2017-07-28T08:23:02Z

chainermn/link.py

+                    if backward_pointer is not None and _x.creator is not None:
+                        _x.creator.rank = -1
+
+                    x = _x if x is None else x + _x


I don't think it is reasonable design to sum up received arrays when receiving from multiple workers. I think it would be more natural and useful to hand it to f as different parameters.

iwiwi · 2017-07-28T08:24:23Z

chainermn/datasets/empty_dataset.py

+import chainer
+
+
+def get_empty_dataset(dataset):


How about changing its name to create_empty_dataset, which is consistent to create_multi_node_optimizer and create_multi_node_evaluator?

iwiwi · 2017-07-28T08:26:30Z

chainermn/link.py

+import chainermn.functions.point_to_point_communication
+
+
+class MultiNodeChainGroup(chainer.ChainList):


Do you have any reasons for the name? I don't have strong opinion, but, how about changing its name to MultiNodeChainList, which is consistent to chainer.ChainList.

iwiwi · 2017-07-28T08:31:39Z

chainermn/link.py

+            import chainermn
+
+
+            class SimpleModelInst(chainer.Chain):


I didn't understand what Inst does mean. How about SimpleModelSub or something like that?

iwiwi · 2017-07-28T08:31:51Z

examples/mnist/train_mnist_model_parallel.py

+        self.add_link(MLP0b(comm), rank_in=1, rank_out=None)
+
+
+class MLP1inst(chainer.Chain):


ditto for inst

iwiwi · 2017-08-04T05:01:01Z

chainermn/datasets/empty_dataset.py

+    ``chainermn.functions.send()``.
+
+    Args:
+        dataset(chainer.datasets.TupleDataset): Dataset to convert.


dataset does not need to be TupleDataset. Chainer accepts many kinds of datasets.

FYI: https://github.com/chainer/chainer/blob/v2.0.2/chainer/iterators/serial_iterator.py#L26

iwiwi · 2017-08-04T05:02:46Z

chainermn/datasets/empty_dataset.py

+        ~chainer.datasets.TransformDataset:
+            Dataset consists of only patterns in the original one.
+    """
+    return chainer.datasets.TransformDataset(dataset, lambda data: ())


Probably just [()] * len(dataset) is enough? (TransformDataset preserves dataset, so it will consume unnecessary memory.)

keisukefukuda · 2017-08-04T05:51:00Z

chainermn/functions/pseudo_connect.py

+
+
+class PseudoConnect(chainer.Function):
+    """Connect a variable with delegating variable."""


"Connect two variables with a delegating variable"
or
"Connect a variable to a delegating variable" ?

keisukefukuda · 2017-08-04T06:39:55Z

chainermn/functions/point_to_point_communication.py

    """Receive elements from target process.

    This function returns data received from target process. If ``backward()``
    is invoked, it will try to send gradients to the target process.

+    .. note::
+        If you define non-connected computational graph on one machine,


machine -> process

keisukefukuda · 2017-08-04T06:49:19Z

tests/functions_tests/test_point_to_point_communication.py

@@ -100,3 +102,31 @@ def test_communication(self):
            err = chainermn.functions.send(
                y, self.communicator, self.rank_send)
            err.backward()
+
+    def test_retain(self):
+        if self.communicator.rank == 0:


Does this test work if more than 2 processes invoked?
It should be skipped if communicator.size > 2 ?

This test also works on more than 2 processes. It emulates test_cycle_model in test_link.py. FYI, we run test on 3 processes in the latest commit.

iwiwi · 2017-08-04T07:28:24Z

LGTM

keisukefukuda · 2017-08-04T09:14:46Z

LGTM!

levelfour added 21 commits July 10, 2017 19:07

Model-parallel example.

2097638

Use empty dataset.

ccaebe1

MultiNodeChain

1e74c5d

Update docstring.

f751651

Rename train_mnist.py to train_mnist_data_parallel.py.

7cc8cb9

Merge branch 'master' into model-parallel-mnist

d2a24f0

Check if comm is ChainerMN communicator.

d948187

Add MultiNodeChainGroup.

1c862a9

Add tests for recv_retain and get_empty_dataset.

7fcff10

Add test for MultiNodeChain.

ff73511

Fix PEP8.

1501b9d

Revise the design of MultiNodeChainGroup.

774aac2

Merge.

360d1af

Extend MultiNodeChainGroup for reversing send and recv.

b221fb6

Branching send & recv.

36d6df3

Do not need division.

b3e4802

Refactoring.

997f4ff

Add test for branching model.

e320ae8

Typo.

69beef0

Update docs.

31d4abb

Assertion.

f34e3e5

levelfour changed the title ~~[WIP] Add model-parallel MNIST example~~ Add model-parallel MNIST example Jul 28, 2017

Increase processes for Travis test.

3c5b957

keisukefukuda requested changes Jul 28, 2017

View reviewed changes

iwiwi reviewed Jul 28, 2017

View reviewed changes

levelfour added 5 commits July 29, 2017 12:15

Fix names.

a9a0bb3

Fix for PEP8.

66989b5

Fix for PEP8.

516d7ff

Fix multiple recv.

bc8ffe5

Rename: merge -> pseudo_connect.

7904d23

levelfour added 9 commits August 2, 2017 12:51

Rename.

a64e5ba

Rename.

0a0b26a

Merge remote-tracking branch 'upstream/master' into model-parallel-mnist

ecb4538

Trivial fix.

20814e5

Make PseudoConnect takes variables as arguments.

4e35f6a

Rename: backward_pointer -> delegate_variable.

8f04184

Fix for PEP8.

9215d81

Move pseudo_connect to a new file.

c521364

Add test for PseudoConnect.

6b0dd6d

keisukefukuda approved these changes Aug 4, 2017

View reviewed changes

levelfour added 2 commits August 4, 2017 11:55

Fix TestPseudoConnect.

f79059e

Fix TestPseudoConnect.

f4633a8

iwiwi suggested changes Aug 4, 2017

View reviewed changes

levelfour added 7 commits August 4, 2017 14:31

Expose pseudo_connect.

56d2375

Fix empty dataset.

20c09f8

Fix for PEP8.

35cd1d2

Add docs for pseudo_connect.

b4cb4a8

Fix a little bit.

f6baf69

Fix test_empty_dataset.

b4ccdcd

Add a little bit in docs of pseudo_connect.

5a9cdd2

keisukefukuda reviewed Aug 4, 2017

View reviewed changes

levelfour added 2 commits August 4, 2017 15:56

Fix comments.:

592fe95

Fix.

cd0d7d8

iwiwi approved these changes Aug 4, 2017

View reviewed changes

keisukefukuda merged commit 1fa4021 into chainer:master Aug 4, 2017

levelfour deleted the model-parallel-mnist branch August 5, 2017 17:13

levelfour mentioned this pull request Aug 6, 2017

MultiNodeChainList with self branching #102

Merged

iwiwi added this to the v1.0.0 milestone Aug 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model-parallel MNIST example #98

Add model-parallel MNIST example #98

levelfour commented Jul 10, 2017 •

edited

Loading

keisukefukuda Jul 28, 2017

levelfour Jul 28, 2017

keisukefukuda Jul 28, 2017

levelfour Jul 29, 2017

keisukefukuda Jul 28, 2017

keisukefukuda Jul 28, 2017

keisukefukuda Jul 28, 2017

keisukefukuda Jul 28, 2017

keisukefukuda Jul 28, 2017

levelfour Jul 29, 2017

iwiwi Jul 28, 2017

iwiwi Jul 28, 2017

levelfour Jul 29, 2017

iwiwi Jul 28, 2017

levelfour Jul 29, 2017

iwiwi Jul 28, 2017

iwiwi Jul 28, 2017

levelfour Jul 29, 2017

iwiwi Aug 4, 2017

iwiwi Aug 4, 2017

iwiwi Aug 4, 2017

keisukefukuda Aug 4, 2017

keisukefukuda Aug 4, 2017

keisukefukuda Aug 4, 2017

levelfour Aug 4, 2017

keisukefukuda Aug 4, 2017

iwiwi commented Aug 4, 2017

keisukefukuda commented Aug 4, 2017

		import chainermn.functions.point_to_point_communication


		class MultiNodeChainGroup(chainer.ChainList):

		self.add_link(MLP0b(comm), rank_in=1, rank_out=None)


		class MLP1inst(chainer.Chain):



		class PseudoConnect(chainer.Function):
		"""Connect a variable with delegating variable."""

Add model-parallel MNIST example #98

Add model-parallel MNIST example #98

Conversation

levelfour commented Jul 10, 2017 • edited Loading

New Features

MultiNodeChainGroup (chainermn.MultiNodeChainGroup)

new utility function chainermn.functions.pseudo_connect

empty dataset (chainermn.datasets.get_empty_dataset)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iwiwi commented Aug 4, 2017

keisukefukuda commented Aug 4, 2017

levelfour commented Jul 10, 2017 •

edited

Loading

MultiNodeChainGroup (`chainermn.MultiNodeChainGroup`)

new utility function `chainermn.functions.pseudo_connect`

empty dataset (`chainermn.datasets.get_empty_dataset`)