Name	Name	Last commit message	Last commit date
parent directory ..
trainer-deep-profiling	trainer-deep-profiling
trainer-hooks-ext	trainer-hooks-ext
trainer-hooks	trainer-hooks
README.md	README.md

TensorFlow Profiling Examples

Before launching training job, please copy the raw data and define the environmental variables (the bucket for staging and the bucket where you are going to store data as well as training job's outputs) export BUCKET=YOUR_BUCKET gsutil -m cp gs://cloud-training-demos/babyweight/preproc/* gs://$BUCKET/babyweight/preproc/ export BUCKET_STAGING=YOUR_STAGING_BUCKET You also need to have bazel installed. The code below is based on this codelab (you can find more here).

Profiler hooks

You can dump profiles for a every n-th step. We are going to demonstrate how to collect dumps both in distributed mode as well as when training is don on a single machine (including training with a GPU accelerator). After you've trained a model (see examples below), you need to copy the dumps localy in order to inspect them. You can do it as follows:

rm -rf /tmp/profiler
mkdir -p /tmp/profiler
gsutil -m cp -r $OUTDIR/timeline*.json /tmp/profiler

And now you can launch a trace event profiling tool in your Chrome browser (chrome://tracing), load a specific timeline and visually inspect it.

Training on a single CPU machine: BASIC CMLE tier

You can launch the job as following:

OUTDIR=gs://$BUCKET/babyweight/hooks_basic
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=us-west1 \
  --module-name=trainer-hooks.task \
  --package-path=trainer-hooks \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET_STAGING \
  --scale-tier=BASIC \
  --runtime-version="1.10" \
  -- \
  --bucket=$BUCKET/babyweight \
  --output_dir=${OUTDIR} \
  --eval_int=1200 \
  --train_steps=50000

Distributed training on CPUs: STANDARD tier

You can launch the job as following:

OUTDIR=gs://$BUCKET/babyweight/hooks_standard
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=us-west1 \
  --module-name=trainer-hooks.task \
  --package-path=trainer-hooks \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET_STAGING \
  --scale-tier=STANDARD_1 \
  --runtime-version="1.10" \
  -- \
  --bucket=$BUCKET/babyweight \
  --output_dir=${OUTDIR} \
  --train_steps=50000

Training on GPU:

OUTDIR=gs://$BUCKET/babyweight/hooks_gpu
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=us-west1 \
  --module-name=trainer-hooks.task \
  --package-path=trainer-hooks \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET_STAGING \
  --scale-tier=BASIC_GPU \
  --runtime-version="1.10" \
  -- \
  --bucket=$BUCKET/babyweight \
  --output_dir=${OUTDIR} \
  --batch_size=8192 \
  --train_steps=21000

Defining you own schedule

OUTDIR=gs://$BUCKET/babyweight/hooks_basic-ext
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=us-west1 \
  --module-name=trainer-hooks-ext.task \
  --package-path=trainer-hooks-ext \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET_STAGING \
  --scale-tier=BASIC \
  --runtime-version="1.10" \
  -- \
  --bucket=$BUCKET/babyweight \
  --output_dir=${OUTDIR} \
  --eval_int=1200 \
  --train_steps=15000

Deep profiling

We can collect a deep profiling dump that can be later analyzed with a profiling CLI tool or with a profiler-ui as described in a post (ADD LINK HERE). Launch the training job as following:

OUTDIR=gs://$BUCKET/babyweight/profiler_standard
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=us-west1 \
  --module-name=trainer-deep-profiling.task \
  --package-path=trainer-deep-profiling \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET_STAGING \
  --scale-tier=STANDARD_1 \
  --runtime-version="1.10" \
  -- \
  --bucket=$BUCKET/babyweight \
  --output_dir=${OUTDIR} \
  --train_steps=100000

Profiler CLI

In order to use profiler CLI, you need to build the profiler first: git clone https://github.com/tensorflow/tensorflow.git cd tensorflow bazel build -c opt tensorflow/core/profiler:profiler
Copy dumps locally: rm -rf /tmp/profiler mkdir -p /tmp/profiler gsutil -m cp -r $OUTDIR/profiler /tmp
Launch the profiler with bazel-bin/tensorflow/core/profiler/profiler --profile_path=/tmp/profiler/$(ls /tmp/profiler/ | head -1)

Profiler UI

You can also use a profiler-ui, i.e. a web interface for a tensorflow profiler.

If you'd like to install pprof, please follow the installation instructions.
Clone the repository: git clone https://github.com/tensorflow/profiler-ui.git cd profiler-ui
Copy dumps locally: rm -rf /tmp/profiler mkdir -p /tmp/profiler gsutil -m cp -r $OUTDIR/$MODEL/profiler /tmp
Launch the profiler with python ui.py --profile_context_path=/tmp/profiler/$(ls /tmp/profiler/ | head -1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-profiling-examples

tensorflow-profiling-examples

README.md

TensorFlow Profiling Examples

Profiler hooks

Training on a single CPU machine: BASIC CMLE tier

Distributed training on CPUs: STANDARD tier

Training on GPU:

Defining you own schedule

Deep profiling

Profiler CLI

Profiler UI

Files

tensorflow-profiling-examples

Directory actions

More options

Directory actions

More options

Latest commit

History

tensorflow-profiling-examples

Folders and files

parent directory

README.md

TensorFlow Profiling Examples

Profiler hooks

Training on a single CPU machine: BASIC CMLE tier

Distributed training on CPUs: STANDARD tier

Training on GPU:

Defining you own schedule

Deep profiling

Profiler CLI

Profiler UI