Update readme and example script (#19)

petuum · Jul 16, 2020 · 9f489bf · 9f489bf
1 parent 412b450
commit 9f489bf
Show file tree

Hide file tree

Showing 12 changed files with 1,061 additions and 219 deletions.
diff --git a/.gitignore b/.gitignore
@@ -87,7 +87,6 @@ tags
 
 # project-specifc
 **/data
-**/logs*
 docs/_build/
 docs/api/
 docs/symlink*

diff --git a/README.md b/README.md
@@ -1,8 +1,9 @@
 
 <p align="center"><img src="docs/_static/img/logo.png" width=400 /></p>
 
-[![pipeline status](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastCompletedBuild/api/json&label=build&query=$.result&color=informational)](https://github.com/petuum/autodist/commits/master)
-[![coverage report](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/coverage-report/jenkinscovdata.json&label=coverage&query=$.total_coverage_pct&color=9cf)](https://github.com/petuum/autodist/commits/master)
+[![pipeline status](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastCompletedBuild/api/json&label=build&query=$.result&color=informational)](https://jenkins.petuum.io/job/AutoDist/job/master/)
+[![coverage report](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/coverage-report/jenkinscovdata.json&label=coverage&query=$.total_coverage_pct&color=green)](https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/)
+[![pypi version](https://img.shields.io/pypi/v/autodist?color=9cf)](https://pypi.org/project/autodist/)
 
 [Documentation](https://petuum.github.io/autodist) |
 [Examples](https://github.com/petuum/autodist/tree/master/examples/benchmark)
@@ -11,8 +12,6 @@
 AutoDist provides a user-friendly interface to distribute the training of a wide variety of deep learning models 
 across many GPUs with scalability and minimal code change.
 
-AutoDist has been tested with TensorFlow versions 1.15 through 2.1. 
-
 ## Introduction
 Different from specialized distributed ML systems, AutoDist is created to speed up a broad range of DL models with excellent all-around performance.
 AutoDist achieves this goal by:
@@ -29,6 +28,8 @@ for all-level users.
 
 <p float="left"><img src="docs/_static/img/Figure1.png" width=400 /><img src="docs/_static/img/Figure2.png" width=400 /></p>
 
+For a closer look at the performance, please refer to our [doc](https://petuum.github.io/autodist/usage/performance.html).
+
 ## Using AutoDist
 
 Installation:

diff --git a/docs/usage/tutorials/getting-started.md b/docs/usage/tutorials/getting-started.md
@@ -7,7 +7,7 @@ We recommended reviewing the [TensorFlow Quickstart Guide](https://www.tensorflo
 particularly the difference between eager and graph mode. 
 If you can run the [Quickstart](https://www.tensorflow.org/tutorials/quickstart/advanced) properly, you can use the same environment to follow this tutorial.
 
-AutoDist currently supports `Python>=3.6` with `tensorflow>=1.15, <=2.1`. Install the downloaded wheel file by
+AutoDist currently supports `Python>=3.6` with `tensorflow>=1.15, <=2.2`. Install the downloaded wheel file by
 
 ```bash
 pip install autodist

diff --git a/examples/benchmark/README.md b/examples/benchmark/README.md
@@ -9,19 +9,19 @@ The instruction for generating the tfrecord data for ImageNet can be found follo
 ```
 # You can set cnn models from vgg16, resnet101, densenet121, inceptionv3
 export CNN_MODEL=resnet101
-python ${REAL_PATH}/imagenet.py --data_dir=${REAL_PATH}/train --train_epochs=10 --cnn_model=$CNN_MODEL --autodist_strategy=$AUTODIST_STRATEGY
-# ${REAL_PATH} is the real path you place the code and dataset
+python ${REAL_SCRIPT_PATH}/imagenet.py --data_dir=${REAL_DATA_PATH}/train --train_epochs=10 --cnn_model=$CNN_MODEL --autodist_strategy=$AUTODIST_STRATEGY
+# ${REAL_SCRIPT_PATH} and ${REAL_DATA_PATH} are the real paths you place the code and dataset
 ```
 
 #### Bidirectional Encoder Representations from Transformers (BERT)
 The instruction for generating the training data and setting up the pre-trained model with the config file can be found following [this link](https://github.com/tensorflow/models/tree/master/official/nlp/bert).
 ```
-python ${REAL_PATH}/bert.py -input_files=${REAL_PATH}/sample_data_tfrecord/*.tfrecord --bert_config_file=${REAL_PATH}/uncased_L-24_H-1024_A-16/bert_config --num_train_epochs=1 --learning_rate=5e-5 --steps_per_loop=20 --autodist_strategy=$AUTODIST_STRATEGY
+python ${REAL_SCRIPT_PATH}/bert.py -input_files=${REAL_DATA_PATH}/sample_data_tfrecord/*.tfrecord --bert_config_file=${REAL_DATA_PATH}/uncased_L-24_H-1024_A-16/bert_config --num_train_epochs=1 --learning_rate=5e-5 --steps_per_loop=20 --autodist_strategy=$AUTODIST_STRATEGY
 ```
 
 #### Neural Collaborative Filtering (NCF) 
 The instruction for generating the training data can be found following [this link](https://github.com/tensorflow/models/tree/master/official/recommendation).
 ```
-python ${REAL_PATH}/ncf.py --default_data_dir=${REAL_PATH}/movielens --autodist_strategy=$AUTODIST_STRATEGY
+python ${REAL_SCRIPT_PATH}/ncf.py --default_data_dir=${REAL_DATA_PATH}/movielens --autodist_strategy=$AUTODIST_STRATEGY
 ```
 
diff --git a/examples/benchmark/bert.py b/examples/benchmark/bert.py
@@ -31,7 +31,6 @@
 from utils.logs import logger
 from utils.misc import keras_utils
 
-from utils import optimization
 from utils import bert_modeling as modeling
 from utils import bert_models
 from utils import common_flags
@@ -63,8 +62,6 @@
 flags.DEFINE_integer('chunk_size', 256, 'The chunk size for training.')
 flags.DEFINE_integer('num_steps_per_epoch', 1000,
                      'Total number of training steps to run per epoch.')
-flags.DEFINE_float('warmup_steps', 10,
-                   'Warmup steps for Adam weight decay optimizer.')
 flags.DEFINE_string(
     name='autodist_strategy',
     default='PS',
@@ -73,10 +70,7 @@
     name='autodist_patch_tf',
     default=True,
     help='AUTODIST_PATCH_TF')
-flags.DEFINE_string(
-    name='optimizer',
-    default='AdamDecay',
-    help='the optimizer to be chosen')
+
 flags.DEFINE_boolean(name='proxy', default=True, help='turn on off the proxy')
 
 
@@ -122,7 +116,6 @@ def run_customized_training(strategy,
                             steps_per_loop,
                             epochs,
                             initial_lr,
-                            warmup_steps,
                             input_files,
                             train_batch_size):
     """Run BERT pretrain model training using low-level API."""
@@ -140,11 +133,8 @@ def _get_pretrain_model():
         """Gets a pretraining model."""
         pretrain_model, core_model = bert_models.pretrain_model(
             bert_config, max_seq_length, max_predictions_per_seq)
-        if FLAGS.optimizer == 'AdamDecay':
-            pretrain_model.optimizer = optimization.create_optimizer(
-                initial_lr, steps_per_epoch * epochs, warmup_steps)
-        else:
-            pretrain_model.optimizer = tf.optimizers.Adam(lr=initial_lr)
+
+        pretrain_model.optimizer = tf.optimizers.Adam(lr=initial_lr)
         if FLAGS.fp16_implementation == 'graph_rewrite':
             pretrain_model.optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(
                 pretrain_model.optimizer)
@@ -190,7 +180,6 @@ def run_bert_pretrain(strategy, gpu_num=1, node_num=1):
         FLAGS.steps_per_loop,
         FLAGS.num_train_epochs,
         FLAGS.learning_rate,
-        FLAGS.warmup_steps,
         FLAGS.input_files,
         FLAGS.train_batch_size * gpu_num * node_num)
 

diff --git a/examples/benchmark/utils/logs/cloud_lib.py b/examples/benchmark/utils/logs/cloud_lib.py
@@ -0,0 +1,34 @@
+# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Utilities that interact with cloud service.
+"""
+
+import requests
+
+GCP_METADATA_URL = "http://metadata/computeMetadata/v1/instance/hostname"
+GCP_METADATA_HEADER = {"Metadata-Flavor": "Google"}
+
+
+def on_gcp():
+  """Detect whether the current running environment is on GCP."""
+  try:
+    # Timeout in 5 seconds, in case the test environment has connectivity issue.
+    # There is not default timeout, which means it might block forever.
+    response = requests.get(
+        GCP_METADATA_URL, headers=GCP_METADATA_HEADER, timeout=5)
+    return response.status_code == 200
+  except requests.exceptions.RequestException:
+    return False
diff --git a/examples/benchmark/utils/logs/hooks.py b/examples/benchmark/utils/logs/hooks.py
@@ -0,0 +1,130 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Hook that counts examples per second every N steps or seconds."""
+
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf  # pylint: disable=g-bad-import-order
+
+from utils.logs import logger
+
+
+class ExamplesPerSecondHook(tf.estimator.SessionRunHook):
+  """Hook to print out examples per second.
+
+  Total time is tracked and then divided by the total number of steps
+  to get the average step time and then batch_size is used to determine
+  the running average of examples per second. The examples per second for the
+  most recent interval is also logged.
+  """
+
+  def __init__(self,
+               batch_size,
+               every_n_steps=None,
+               every_n_secs=None,
+               warm_steps=0,
+               metric_logger=None):
+    """Initializer for ExamplesPerSecondHook.
+
+    Args:
+      batch_size: Total batch size across all workers used to calculate
+        examples/second from global time.
+      every_n_steps: Log stats every n steps.
+      every_n_secs: Log stats every n seconds. Exactly one of the
+        `every_n_steps` or `every_n_secs` should be set.
+      warm_steps: The number of steps to be skipped before logging and running
+        average calculation. warm_steps steps refers to global steps across all
+        workers, not on each worker
+      metric_logger: instance of `BenchmarkLogger`, the benchmark logger that
+          hook should use to write the log. If None, BaseBenchmarkLogger will
+          be used.
+
+    Raises:
+      ValueError: if neither `every_n_steps` or `every_n_secs` is set, or
+      both are set.
+    """
+
+    if (every_n_steps is None) == (every_n_secs is None):
+      raise ValueError("exactly one of every_n_steps"
+                       " and every_n_secs should be provided.")
+
+    self._logger = metric_logger or logger.BaseBenchmarkLogger()
+
+    self._timer = tf.estimator.SecondOrStepTimer(
+        every_steps=every_n_steps, every_secs=every_n_secs)
+
+    self._step_train_time = 0
+    self._total_steps = 0
+    self._batch_size = batch_size
+    self._warm_steps = warm_steps
+    # List of examples per second logged every_n_steps.
+    self.current_examples_per_sec_list = []
+
+  def begin(self):
+    """Called once before using the session to check global step."""
+    self._global_step_tensor = tf.compat.v1.train.get_global_step()
+    if self._global_step_tensor is None:
+      raise RuntimeError(
+          "Global step should be created to use StepCounterHook.")
+
+  def before_run(self, run_context):  # pylint: disable=unused-argument
+    """Called before each call to run().
+
+    Args:
+      run_context: A SessionRunContext object.
+
+    Returns:
+      A SessionRunArgs object or None if never triggered.
+    """
+    return tf.estimator.SessionRunArgs(self._global_step_tensor)
+
+  def after_run(self, run_context, run_values):  # pylint: disable=unused-argument
+    """Called after each call to run().
+
+    Args:
+      run_context: A SessionRunContext object.
+      run_values: A SessionRunValues object.
+    """
+    global_step = run_values.results
+
+    if self._timer.should_trigger_for_step(
+        global_step) and global_step > self._warm_steps:
+      elapsed_time, elapsed_steps = self._timer.update_last_triggered_step(
+          global_step)
+      if elapsed_time is not None:
+        self._step_train_time += elapsed_time
+        self._total_steps += elapsed_steps
+
+        # average examples per second is based on the total (accumulative)
+        # training steps and training time so far
+        average_examples_per_sec = self._batch_size * (
+            self._total_steps / self._step_train_time)
+        # current examples per second is based on the elapsed training steps
+        # and training time per batch
+        current_examples_per_sec = self._batch_size * (
+            elapsed_steps / elapsed_time)
+        # Logs entries to be read from hook during or after run.
+        self.current_examples_per_sec_list.append(current_examples_per_sec)
+        self._logger.log_metric(
+            "average_examples_per_sec", average_examples_per_sec,
+            global_step=global_step)
+
+        self._logger.log_metric(
+            "current_examples_per_sec", current_examples_per_sec,
+            global_step=global_step)