Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing outofdate readme.md for heterogeneous clusters feature #3617

Merged
merged 23 commits into from
Oct 12, 2022
Merged
Changes from 22 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
6b031e6
initial commit
gilinachum Sep 13, 2022
9d50385
notebook fix and misspelling
gilinachum Sep 13, 2022
06da2ce
add link from root readme.md
gilinachum Sep 13, 2022
bd479c6
switching cifar-10 to artificial dataset for TF
gilinachum Sep 24, 2022
fe7f8ce
adding retries to fit()
gilinachum Sep 24, 2022
9004b21
Merge branch 'main' into main
gilinachum Sep 24, 2022
790d99d
grammer fixes
gilinachum Sep 26, 2022
38ef3f5
Merge branch 'main' of https://github.com/gilinachum/amazon-sagemaker…
gilinachum Sep 26, 2022
d47542b
remove cifar references
gilinachum Sep 26, 2022
2da3cd9
Removing local tf and pt execution exmaples
gilinachum Oct 4, 2022
9bc6d34
Add security group info for private VPC use case
gilinachum Oct 4, 2022
3415bd9
Adding index.rst for heterogeneous clusters
gilinachum Oct 4, 2022
0490d31
Merge branch 'aws:main' into main
gilinachum Oct 4, 2022
280dbbd
fix PT notebook heading for rst
gilinachum Oct 4, 2022
f8ee5df
fix rst and notebook tables for rst
gilinachum Oct 4, 2022
203861a
Merge branch 'main' of https://github.com/gilinachum/amazon-sagemaker…
gilinachum Oct 4, 2022
307ae9d
Adding programmatic kernel restart
gilinachum Oct 4, 2022
c56262b
removing programmatic kernel restart - breaks CI
gilinachum Oct 4, 2022
6b687d3
Remove tables that don't render in RST
gilinachum Oct 4, 2022
bd63e35
Merge branch 'aws:main' into main
gilinachum Oct 6, 2022
8805853
updating outofdate readme.md
gilinachum Oct 6, 2022
271b0a9
Merge branch 'aws:main' into main
gilinachum Oct 9, 2022
3003979
Merge branch 'aws:main' into main
gilinachum Oct 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 21 additions & 29 deletions training/heterogeneous-clusters/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,30 @@
# Heterogeneous Clusters
SageMaker Training Heterogeneous Clusters allows you to run one training job
that includes instances of different types. For example a GPU instance like
ml.p4d.24xlarge and a CPU instance like c5.18xlarge.
# SageMaker Heterogeneous Clusters for Model Training
In July 2022, we [launched](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training/) heterogeneous clusters for Amazon SageMaker
model training, which enables you to launch training jobs that use different instance types and
families in a single job. A primary use case is offloading data preprocessing to
compute-optimized instance types, whereas the deep neural network (DNN) process continues to
run on GPU or ML accelerated instance types.

One primary use case is offloading CPU intensive tasks like image
pre-processing (data augmentation) from the GPU instance to a dedicate
CPU instance, so you can fully utilize the expensive GPUs, and arrive at
an improved time and cost to train.

You'll find TensorFlow (tf.data.service) and PyTorch (a customer gRPC based distributed data loading) examples on how to utilize Heterogeneous clusters in your training jobs. You can reuse these examples when enabling your own training workload to use heterogeneous clusters.
In this repository, you'll find TensorFlow (tf.data.service) and PyTorch (a custom gRPC based distributed data loading) examples which demonstrates how to use heterogeneous clusters in your SageMaker training jobs. You can use these examples with minimal code changes in your existing training scripts.

![Hetero job diagram](tf.data.service.sagemaker/images/basic-heterogeneous-job.png)

## Examples:

### TensorFlow examples
- [**TensorFlow's tf.data.service running locally**](tf.data.service.local/README.md):
This example runs the tf.data.service locally on your machine (not on SageMaker). It's helpful in order to get familiar with tf.data.service and to run small scale quick experimentation.
### Hello world example
- [**Heterogeneous Clusters - a hello world example**](hello.world.sagemaker/helloworld-example.ipynb):
This basic example runs a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type.
Each instance prints its instance group information and exits.
Note: This example only shows how to orchestrate the training job with instance type. For actual code to help with a distributed data loader, see the TensorFlow or PyTorch examples below.

- [**TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters**](tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb):
This TensorFlow example runs a Homogenous trainign job and compares its results with a Heterogeneous Clusters SageMaker training job that runs with two instance groups:
- `data_group` - this group has two ml.c5.18xlarge instances to which data augmentation is offloaded.
- `dnn_group` - Running one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution.
### TensorFlow examples
- [**TensorFlow's tf.data.service based Amazon SageMaker Heterogeneous Clusters**](tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb):
This TensorFlow example runs both Homogeneous and Heterogeneous clusters SageMaker training job, and compares their results. The heterogeneous cluster training job runs with two instance groups:
- `data_group` - this group has two ml.c5.18xlarge instances to which data preprocessing/augmentation is offloaded.
- `dnn_group` - this group has one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution.

### PyTorch examples
- [**PyTorch with gRPC distributed dataloader running locally**](pt.grpc.local/README.md):
This Pytorch example runs a training job split into two processes locally on your machine (not on SageMaker). It's helpful in order to get familiar with the GRPC distributed data loader and to run small scale quick experimentation.

- [**PyTorch with gRPC distributed dataloader Heterogeneous Clusters training job example**](pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb):
This PyTorch example runs a Hetero SageMaker training job that uses gRPC to offload data augmentation to a CPU based server.


### Hello world example
- [**Hetero Training Job - Hello world**](hello.world.sagemaker/README.md):
This basic example run a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type.
Each instance prints its instance group information and exits.
Note: This example only shows how to orchastrate the training job with instance type, for actual code to help with a distributed data loader, see the TF or PT examples below.
- [**PyTorch and gRPC distributed dataloader based Amazon SageMaker Heterogeneous Clusters**](pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb):
This PyTorch example enables you to run both Homogeneous and Heterogeneous clusters SageMaker training job. We then compare their results, and understand price performance benefits.
- `data_group` - this group has one ml.c5.9xlarge instance for offloading data preprocessing job.
- `dnn_group` - this group has one ml.p3.2xlarge instance