Support for TF Distributed Training - MirroredStrategy and ParameterS…

…erverStrategy (aws#290) * pytest passing * fix * pass all tests * framework-var * hack to get tensor object * all tests passing * linter changes * lint * comment update * close_writer ==> close_writers bug * filter tensorname * do not break parameter server * tb worker resolve * get unique tb writers * modify check dir exists * test fix * verify s3 path without End of job * parameter server training with tornasole example * test tf device name serialization * parse worker tests * minor fix and documentation * removed worker initialization from constructor * removed worker name initialization from FileWriter * removed tensor cache, refactored code for better abstraction * cleaner check for hvd * parameter-server worker based training * test_tf_utils * more comments * cleanup env variable in tests * refactor changes * changes based on PR comments * refactor, comments and docstrings * refactor * fixed parameter server bug, more comments * fixed parameter server bug, more comments * more documentation * nit fix typo * documentation * missing del makes CI crash * nit fixes
atqy · Nov 1, 2019 · e7ed26e · e7ed26e
1 parent 4a18955
commit e7ed26e
Show file tree

Hide file tree

Showing 29 changed files with 1,760 additions and 83 deletions.
diff --git a/docs/tensorflow/examples/distributed_training/horovod_mnist_estimator.md b/docs/tensorflow/examples/distributed_training/horovod_mnist_estimator.md
@@ -0,0 +1,91 @@
+#Horovod MNIST Example
+We provide an example script `horovod_mnist_estimator.py` which is a Tornasole-enabled Horovod training script
+that uses the Estimator interface of TensorFlow.
+
+This is an example of how you can log a distributed training job with Tornasole.
+
+## Integrating Tornasole
+Below we call out the changes for Tornasole in the above script and describe them
+
+**Importing TornasoleTF**
+```
+import tornasole.tensorflow as ts
+```
+**Saving gradients**
+
+We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss.
+This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients.
+```
+opt = TornasoleOptimizer(opt)
+```
+
+
+**Setting save interval**
+
+You can set different save intervals for different modes.
+This can be done by passing a dictionary as save_config to the hook.
+This dictionary should have the mode as key and a SaveConfig object as value.
+```
+ts.TornasoleHook(...,
+    save_config=ts.SaveConfig(save_interval=FLAGS.tornasole_frequency),
+```
+**Setting the right mode**
+
+Notice the calls to `hook.set_mode` at various places in the code.
+```
+ts_hook.set_mode(ts.modes.TRAIN)
+```
+
+```
+ts_hook.set_mode(ts.modes.EVAL)
+```
+**Passing the hook**
+
+We need to pass this hook to a monitored session and use this session for running the job.
+```
+ts_hook = ts.TornasoleHook(...)
+mnist_classifier.train(..., hooks=[...,ts_hook])
+```
+
+```
+mnist_classifier.evaluate(..., hooks=[..., ts_hook])
+```
+## Running the example
+### Environment
+Ensure you are in a python environment which has tornasole_core installed. If you followed the recommended instructions of using Amazon Deep Learning AMI, then you might want to activate the tensorflow_p36 environment as follows.
+```
+source activate tensorflow_p36
+```
+### Tornasole Path
+We recommend saving tornasole outputs on S3 by passing the
+flag `--tornasole_path` in the format `s3://bucket_name/prefix`.
+The commands below will be shown with local path however so you can
+run them immediately without having to setup S3 permissions.
+
+### Command
+
+To run on a machine with 4 GPUs:
+```
+horovodrun -np 4 -H localhost:4 python horovod_mnist_estimator.py\
+ --tornasole_path ~/ts_output/ts_horovod/logs\
+ --steps 5000\
+ --tornasole_frequency 100\
+ --reductions False
+ --save_all True
+```
+
+To run on 4 machines with 4 GPUs each:
+```
+horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python horovod_mnist_estimator.py\
+ --tornasole_path ~/ts_output/ts_horovod/logs\
+ --steps 5000\
+ --tornasole_frequency 100\
+ --reductions False
+ --save_all True
+```
+
+### Analysis
+Refer [this page](../../../rules/README.md) for more details on analysis.
+
+### More
+Please refer to [Tornasole Tensorflow page](../../README.md).
diff --git a/docs/tensorflow/examples/distributed_training/mirrored_strategy_mnist.md b/docs/tensorflow/examples/distributed_training/mirrored_strategy_mnist.md
@@ -0,0 +1,84 @@
+# MirroredStrategy MNIST Example
+We provide an example script `mirrored_strategy_mnist.py` which is a Tornasole-enabled TensorFlow training script
+that uses the MirroredStrategy to perform distributed training.
+
+It uses the Estimator interface of TensorFlow.
+
+This is an example of how you can log a distributed training job with Tornasole.
+
+## Integrating Tornasole
+Below we call out the changes for Tornasole in the above script and describe them
+
+**Importing TornasoleTF**
+```
+import tornasole.tensorflow as ts
+```
+**Saving gradients**
+
+We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss.
+This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients.
+```
+optimizer = ts.TornasoleOptimizer(optimizer)
+```
+
+
+**Setting save interval**
+
+You can set different save intervals for different modes.
+This can be done by passing a dictionary as save_config to the hook.
+This dictionary should have the mode as key and a SaveConfig object as value.
+```
+ts.TornasoleHook(...,
+    save_config=ts.SaveConfig(save_interval=FLAGS.tornasole_frequency),
+```
+**Setting the right mode**
+
+Notice the calls to `hook.set_mode` at various places in the code.
+```
+hook.set_mode(ts.modes.TRAIN)
+```
+
+```
+hook.set_mode(ts.modes.EVAL)
+```
+**Passing the hook**
+
+We need to pass this hook to a monitored session and use this session for running the job.
+```
+ts_hook = ts.TornasoleHook(...)
+mnist_classifier.train(..., hooks=[ts_hook])
+```
+
+```
+mnist_classifier.evaluate(..., hooks=[ts_hook])
+```
+## Running the example
+### Environment
+Ensure you are in a python environment which has tornasole_core installed. If you followed the recommended instructions of using Amazon Deep Learning AMI, then you might want to activate the tensorflow_p36 environment as follows.
+```
+source activate tensorflow_p36
+```
+### Tornasole Path
+We recommend saving tornasole outputs on S3 by passing the
+flag `--tornasole_path` in the format `s3://bucket_name/prefix`.
+The commands below will be shown with local path however so you can
+run them immediately without having to setup S3 permissions.
+
+### Command
+
+To run on a machine with GPUs:
+```
+python mirrored_strategy_mnist.py \
+--tornasole_path ~/ts_outputs/mirrored_strategy_mnist \
+ --steps 5000\
+ --tornasole_frequency 100\
+ --reductions False
+ --save_all True
+
+```
+
+### Analysis
+Refer [this page](../../../rules/README.md) for more details on analysis.
+
+### More
+Please refer to [Tornasole Tensorflow page](../../README.md).
diff --git a/...amples/distributed_training/parameter_server_training/parameter_server_mnist.md b/...amples/distributed_training/parameter_server_training/parameter_server_mnist.md
@@ -0,0 +1,146 @@
+# ParameterServerStrategy MNIST Example
+We provide an example script `parameter_server_mnist.py` which is a Tornasole-enabled TensorFlow training script
+that uses the ParameterServer to perform distributed training.
+
+It uses the Estimator interface of TensorFlow.
+
+This is an example of how you can log a distributed training job with Tornasole.
+
+## Integrating Tornasole
+Below we call out the changes for Tornasole in the above script and describe them
+
+**Importing TornasoleTF**
+```
+import tornasole.tensorflow as ts
+```
+**Saving gradients**
+
+We need to wrap our optimizer with TornasoleOptimizer, and use this optimizer to minimize loss.
+This will also enable us to access the gradients during analysis without having to identify which tensors out of the saved ones are the gradients.
+```
+optimizer = ts.TornasoleOptimizer(optimizer)
+```
+
+
+**Setting save interval**
+
+You can set different save intervals for different modes.
+This can be done by passing a dictionary as save_config to the hook.
+This dictionary should have the mode as key and a SaveConfig object as value.
+```
+ts.TornasoleHook(...,
+    save_config=ts.SaveConfig(save_interval=FLAGS.tornasole_frequency),
+```
+**Setting the right mode**
+
+Notice the calls to `hook.set_mode` at various places in the code.
+```
+hook.set_mode(ts.modes.TRAIN)
+```
+
+```
+hook.set_mode(ts.modes.EVAL)
+```
+**Passing the hook**
+
+We need to pass this hook to a monitored session and use this session for running the job.
+```
+ts_hook = ts.TornasoleHook(...)
+mnist_classifier.train(..., hooks=[ts_hook])
+```
+
+```
+mnist_classifier.evaluate(..., hooks=[ts_hook])
+```
+## Running the example
+### Environment
+Ensure you are in a python environment which has tornasole_core installed. If you followed the recommended instructions of using Amazon Deep Learning AMI, then you might want to activate the tensorflow_p36 environment as follows.
+```
+source activate tensorflow_p36
+```
+### Tornasole Path
+We recommend saving tornasole outputs on S3 by passing the
+flag `--tornasole_path` in the format `s3://bucket_name/prefix`.
+The commands below will be shown with local path however so you can
+run them immediately without having to setup S3 permissions.
+
+### Command
+
+The example script performs distributed training 2 workers and 1 parameter server by default.
+
+The cluster config used by the demo can be overriden by setting the
+
+TF_CONFIG environment variable before running the script. Details for that can be found here. [link](https://www.tensorflow.org/guide/distributed_training#setting_up_tf_config_environment_variable)
+
+
+The default cluster specification used by the script is given below:
+
+```
+        os.environ["TF_CONFIG"] = json.dumps(
+            {
+                "cluster": {"worker": [nodes[0], nodes[1]], "ps": [nodes[2]]},
+                "task": {"type": FLAGS.node_type, "index": FLAGS.task_index},
+            }
+        )
+```
+
+The values of the nodes is populated by this snippet in the script:
+
+```
+    try:
+        f = open(FLAGS.hostfile)
+        for line in f.readlines():
+            nodes.append(line.strip())
+    except OSError as e:
+        print(e.errno)
+```
+
+The script uses a hostfile as an input, where each line corresponds to a node.
+
+A sample hostfile can be found here [hostfile.txt](../../../../../examples/tensorflow/scripts/distributed_training/parameter_server_training/hostfile.txt)
+
+To setup the parameter server:
+
+```
+python parameter_server_mnist.py \
+--hostfile hostfile.txt \
+--steps 1000 \
+--tornasole_path ~/ts_output/ps_training  \
+--tornasole_frequency 100 \
+--node_type ps --task_index 0
+```
+
+To setup the first worker server:
+
+```
+python parameter_server_mnist.py \
+--hostfile hostfile.txt \
+--steps 1000 \
+--tornasole_path ~/ts_output/ps_training  \
+--tornasole_frequency 100 \
+--node_type worker --task_index 0
+```
+
+To setup the second worker server:
+
+```
+python parameter_server_mnist.py \
+--hostfile hostfile.txt \
+--steps 1000 \
+--tornasole_path ~/ts_output/ps_training  \
+--tornasole_frequency 100 \
+--node_type worker --task_index 1
+```
+
+
+Note: You can limit the number of GPUs used by each worker by setting. See [link](https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on)
+```
+export CUDA_VISIBLE_DEVICES=0,1
+```
+
+
+### Analysis
+Refer [this page](../../../../rules/README.md) for more details on analysis.
+
+### More
+Please refer to [Tornasole Tensorflow page](../../../README.md).