Skip to content

Commit

Permalink
Update readme and example script (#19)
Browse files Browse the repository at this point in the history
  • Loading branch information
ZeyaWang authored Jul 16, 2020
1 parent 412b450 commit 9f489bf
Show file tree
Hide file tree
Showing 12 changed files with 1,061 additions and 219 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,6 @@ tags

# project-specifc
**/data
**/logs*
docs/_build/
docs/api/
docs/symlink*
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@

<p align="center"><img src="docs/_static/img/logo.png" width=400 /></p>

[![pipeline status](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastCompletedBuild/api/json&label=build&query=$.result&color=informational)](https://github.com/petuum/autodist/commits/master)
[![coverage report](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/coverage-report/jenkinscovdata.json&label=coverage&query=$.total_coverage_pct&color=9cf)](https://github.com/petuum/autodist/commits/master)
[![pipeline status](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastCompletedBuild/api/json&label=build&query=$.result&color=informational)](https://jenkins.petuum.io/job/AutoDist/job/master/)
[![coverage report](https://img.shields.io/badge/dynamic/json?url=https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/coverage-report/jenkinscovdata.json&label=coverage&query=$.total_coverage_pct&color=green)](https://jenkins.petuum.io/job/AutoDist/job/master/lastSuccessfulBuild/artifact/)
[![pypi version](https://img.shields.io/pypi/v/autodist?color=9cf)](https://pypi.org/project/autodist/)

[Documentation](https://petuum.github.io/autodist) |
[Examples](https://github.com/petuum/autodist/tree/master/examples/benchmark)
Expand All @@ -11,8 +12,6 @@
AutoDist provides a user-friendly interface to distribute the training of a wide variety of deep learning models
across many GPUs with scalability and minimal code change.

AutoDist has been tested with TensorFlow versions 1.15 through 2.1.

## Introduction
Different from specialized distributed ML systems, AutoDist is created to speed up a broad range of DL models with excellent all-around performance.
AutoDist achieves this goal by:
Expand All @@ -29,6 +28,8 @@ for all-level users.

<p float="left"><img src="docs/_static/img/Figure1.png" width=400 /><img src="docs/_static/img/Figure2.png" width=400 /></p>

For a closer look at the performance, please refer to our [doc](https://petuum.github.io/autodist/usage/performance.html).

## Using AutoDist

Installation:
Expand Down
2 changes: 1 addition & 1 deletion docs/usage/tutorials/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ We recommended reviewing the [TensorFlow Quickstart Guide](https://www.tensorflo
particularly the difference between eager and graph mode.
If you can run the [Quickstart](https://www.tensorflow.org/tutorials/quickstart/advanced) properly, you can use the same environment to follow this tutorial.

AutoDist currently supports `Python>=3.6` with `tensorflow>=1.15, <=2.1`. Install the downloaded wheel file by
AutoDist currently supports `Python>=3.6` with `tensorflow>=1.15, <=2.2`. Install the downloaded wheel file by

```bash
pip install autodist
Expand Down
8 changes: 4 additions & 4 deletions examples/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,19 @@ The instruction for generating the tfrecord data for ImageNet can be found follo
```
# You can set cnn models from vgg16, resnet101, densenet121, inceptionv3
export CNN_MODEL=resnet101
python ${REAL_PATH}/imagenet.py --data_dir=${REAL_PATH}/train --train_epochs=10 --cnn_model=$CNN_MODEL --autodist_strategy=$AUTODIST_STRATEGY
# ${REAL_PATH} is the real path you place the code and dataset
python ${REAL_SCRIPT_PATH}/imagenet.py --data_dir=${REAL_DATA_PATH}/train --train_epochs=10 --cnn_model=$CNN_MODEL --autodist_strategy=$AUTODIST_STRATEGY
# ${REAL_SCRIPT_PATH} and ${REAL_DATA_PATH} are the real paths you place the code and dataset
```

#### Bidirectional Encoder Representations from Transformers (BERT)
The instruction for generating the training data and setting up the pre-trained model with the config file can be found following [this link](https://github.com/tensorflow/models/tree/master/official/nlp/bert).
```
python ${REAL_PATH}/bert.py -input_files=${REAL_PATH}/sample_data_tfrecord/*.tfrecord --bert_config_file=${REAL_PATH}/uncased_L-24_H-1024_A-16/bert_config --num_train_epochs=1 --learning_rate=5e-5 --steps_per_loop=20 --autodist_strategy=$AUTODIST_STRATEGY
python ${REAL_SCRIPT_PATH}/bert.py -input_files=${REAL_DATA_PATH}/sample_data_tfrecord/*.tfrecord --bert_config_file=${REAL_DATA_PATH}/uncased_L-24_H-1024_A-16/bert_config --num_train_epochs=1 --learning_rate=5e-5 --steps_per_loop=20 --autodist_strategy=$AUTODIST_STRATEGY
```

#### Neural Collaborative Filtering (NCF)
The instruction for generating the training data can be found following [this link](https://github.com/tensorflow/models/tree/master/official/recommendation).
```
python ${REAL_PATH}/ncf.py --default_data_dir=${REAL_PATH}/movielens --autodist_strategy=$AUTODIST_STRATEGY
python ${REAL_SCRIPT_PATH}/ncf.py --default_data_dir=${REAL_DATA_PATH}/movielens --autodist_strategy=$AUTODIST_STRATEGY
```

17 changes: 3 additions & 14 deletions examples/benchmark/bert.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@
from utils.logs import logger
from utils.misc import keras_utils

from utils import optimization
from utils import bert_modeling as modeling
from utils import bert_models
from utils import common_flags
Expand Down Expand Up @@ -63,8 +62,6 @@
flags.DEFINE_integer('chunk_size', 256, 'The chunk size for training.')
flags.DEFINE_integer('num_steps_per_epoch', 1000,
'Total number of training steps to run per epoch.')
flags.DEFINE_float('warmup_steps', 10,
'Warmup steps for Adam weight decay optimizer.')
flags.DEFINE_string(
name='autodist_strategy',
default='PS',
Expand All @@ -73,10 +70,7 @@
name='autodist_patch_tf',
default=True,
help='AUTODIST_PATCH_TF')
flags.DEFINE_string(
name='optimizer',
default='AdamDecay',
help='the optimizer to be chosen')

flags.DEFINE_boolean(name='proxy', default=True, help='turn on off the proxy')


Expand Down Expand Up @@ -122,7 +116,6 @@ def run_customized_training(strategy,
steps_per_loop,
epochs,
initial_lr,
warmup_steps,
input_files,
train_batch_size):
"""Run BERT pretrain model training using low-level API."""
Expand All @@ -140,11 +133,8 @@ def _get_pretrain_model():
"""Gets a pretraining model."""
pretrain_model, core_model = bert_models.pretrain_model(
bert_config, max_seq_length, max_predictions_per_seq)
if FLAGS.optimizer == 'AdamDecay':
pretrain_model.optimizer = optimization.create_optimizer(
initial_lr, steps_per_epoch * epochs, warmup_steps)
else:
pretrain_model.optimizer = tf.optimizers.Adam(lr=initial_lr)

pretrain_model.optimizer = tf.optimizers.Adam(lr=initial_lr)
if FLAGS.fp16_implementation == 'graph_rewrite':
pretrain_model.optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(
pretrain_model.optimizer)
Expand Down Expand Up @@ -190,7 +180,6 @@ def run_bert_pretrain(strategy, gpu_num=1, node_num=1):
FLAGS.steps_per_loop,
FLAGS.num_train_epochs,
FLAGS.learning_rate,
FLAGS.warmup_steps,
FLAGS.input_files,
FLAGS.train_batch_size * gpu_num * node_num)

Expand Down
34 changes: 34 additions & 0 deletions examples/benchmark/utils/logs/cloud_lib.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Utilities that interact with cloud service.
"""

import requests

GCP_METADATA_URL = "http://metadata/computeMetadata/v1/instance/hostname"
GCP_METADATA_HEADER = {"Metadata-Flavor": "Google"}


def on_gcp():
"""Detect whether the current running environment is on GCP."""
try:
# Timeout in 5 seconds, in case the test environment has connectivity issue.
# There is not default timeout, which means it might block forever.
response = requests.get(
GCP_METADATA_URL, headers=GCP_METADATA_HEADER, timeout=5)
return response.status_code == 200
except requests.exceptions.RequestException:
return False
130 changes: 130 additions & 0 deletions examples/benchmark/utils/logs/hooks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Hook that counts examples per second every N steps or seconds."""


from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf # pylint: disable=g-bad-import-order

from utils.logs import logger


class ExamplesPerSecondHook(tf.estimator.SessionRunHook):
"""Hook to print out examples per second.
Total time is tracked and then divided by the total number of steps
to get the average step time and then batch_size is used to determine
the running average of examples per second. The examples per second for the
most recent interval is also logged.
"""

def __init__(self,
batch_size,
every_n_steps=None,
every_n_secs=None,
warm_steps=0,
metric_logger=None):
"""Initializer for ExamplesPerSecondHook.
Args:
batch_size: Total batch size across all workers used to calculate
examples/second from global time.
every_n_steps: Log stats every n steps.
every_n_secs: Log stats every n seconds. Exactly one of the
`every_n_steps` or `every_n_secs` should be set.
warm_steps: The number of steps to be skipped before logging and running
average calculation. warm_steps steps refers to global steps across all
workers, not on each worker
metric_logger: instance of `BenchmarkLogger`, the benchmark logger that
hook should use to write the log. If None, BaseBenchmarkLogger will
be used.
Raises:
ValueError: if neither `every_n_steps` or `every_n_secs` is set, or
both are set.
"""

if (every_n_steps is None) == (every_n_secs is None):
raise ValueError("exactly one of every_n_steps"
" and every_n_secs should be provided.")

self._logger = metric_logger or logger.BaseBenchmarkLogger()

self._timer = tf.estimator.SecondOrStepTimer(
every_steps=every_n_steps, every_secs=every_n_secs)

self._step_train_time = 0
self._total_steps = 0
self._batch_size = batch_size
self._warm_steps = warm_steps
# List of examples per second logged every_n_steps.
self.current_examples_per_sec_list = []

def begin(self):
"""Called once before using the session to check global step."""
self._global_step_tensor = tf.compat.v1.train.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError(
"Global step should be created to use StepCounterHook.")

def before_run(self, run_context): # pylint: disable=unused-argument
"""Called before each call to run().
Args:
run_context: A SessionRunContext object.
Returns:
A SessionRunArgs object or None if never triggered.
"""
return tf.estimator.SessionRunArgs(self._global_step_tensor)

def after_run(self, run_context, run_values): # pylint: disable=unused-argument
"""Called after each call to run().
Args:
run_context: A SessionRunContext object.
run_values: A SessionRunValues object.
"""
global_step = run_values.results

if self._timer.should_trigger_for_step(
global_step) and global_step > self._warm_steps:
elapsed_time, elapsed_steps = self._timer.update_last_triggered_step(
global_step)
if elapsed_time is not None:
self._step_train_time += elapsed_time
self._total_steps += elapsed_steps

# average examples per second is based on the total (accumulative)
# training steps and training time so far
average_examples_per_sec = self._batch_size * (
self._total_steps / self._step_train_time)
# current examples per second is based on the elapsed training steps
# and training time per batch
current_examples_per_sec = self._batch_size * (
elapsed_steps / elapsed_time)
# Logs entries to be read from hook during or after run.
self.current_examples_per_sec_list.append(current_examples_per_sec)
self._logger.log_metric(
"average_examples_per_sec", average_examples_per_sec,
global_step=global_step)

self._logger.log_metric(
"current_examples_per_sec", current_examples_per_sec,
global_step=global_step)
Loading

0 comments on commit 9f489bf

Please sign in to comment.