Skip to content

Commit

Permalink
Adding Heterogeneous Clusters example for TensorFlow and PyTorch (aws…
Browse files Browse the repository at this point in the history
…#3599)

* initial commit

* notebook fix and misspelling

* add link from root readme.md

* switching cifar-10 to artificial dataset for TF

* adding retries to fit()

* grammer fixes

* remove cifar references

* Removing local tf and pt execution exmaples

* Add security group info for private VPC use case

* Adding index.rst for heterogeneous clusters

* fix PT notebook heading for rst

* fix rst and notebook tables for rst

* Adding programmatic kernel restart

* removing programmatic kernel restart - breaks CI

* Remove tables that don't render in RST
  • Loading branch information
gilinachum authored and atqy committed Oct 28, 2022
1 parent e117b35 commit f1729fe
Show file tree
Hide file tree
Showing 39 changed files with 3,483 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ These examples showcase unique functionality available in Amazon SageMaker. They
- [Host Multiple Models with SKLearn](advanced_functionality/multi_model_sklearn_home_value) shows how to deploy multiple models to a realtime hosted endpoint using a multi-model enabled SKLearn container.
- [SageMaker Training and Inference with Script Mode](sagemaker-script-mode) shows how to use custom training and inference scripts, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks like Scikit-learn, PyTorch, and XGBoost.
- [Host Models with NVidia Triton Server](sagemaker-triton) shows how to deploy models to a realtime hosted endpoint using [Triton](https://developer.nvidia.com/nvidia-triton-inference-server) as the model inference server.
- [Heterogenous Clusters Training in TensorFlow or PyTorch ](training/heterogeneous-clusters/README.md) shows how to train using TensorFlow tf.data.service (distributed data pipeline) or Pytorch (with gRPC) on top of Amazon SageMaker Heterogenous clusters to overcome CPU bottlenecks by including different instance types (GPU/CPU) in the same training job.

### Amazon SageMaker Neo Compilation Jobs

Expand Down
2 changes: 1 addition & 1 deletion index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ More examples
sagemaker-script-mode/index
training/bring_your_own_container
training/management

training/heterogeneous-clusters/index

.. toctree::
:maxdepth: 1
Expand Down
40 changes: 40 additions & 0 deletions sagemaker-datawrangler/readme.md
Original file line number Diff line number Diff line change
@@ -1 +1,41 @@
![Amazon SageMaker Data Wrangler](https://github.com/aws/amazon-sagemaker-examples/raw/main/_static/sagemaker-banner.png)

# Amazon SageMaker Data Wrangler Examples

Example flows that demonstrate how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler.

## :books: Background

[Amazon SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/) reduces the time it takes to aggregate and prepare data for ML. From a single interface in SageMaker Studio, you can import data from Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, and Amazon SageMaker Feature Store, and in just a few clicks SageMaker Data Wrangler will automatically load, aggregate, and display the raw data. It will then make conversion recommendations based on the source data, transform the data into new features, validate the features, and provide visualizations with recommendations on how to remove common sources of error such as incorrect labels. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines or import that data into Amazon SageMaker Feature Store.



The [SageMaker example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/) are Jupyter notebooks that demonstrate the usage of Amazon SageMaker.

## :hammer_and_wrench: Setup

Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio. Use this section to learn how to access and get started using Data Wrangler. Do the following:

* Complete each step in [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-prerequisite).

* Follow the procedure in [Access Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-access) to start using Data Wrangler.




## :notebook: Examples

### **[Tabular DataFlow](tabular-dataflow/README.md)**

This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Tabular dataset.

### **[Timeseries DataFlow](timeseries-dataflow/readme.md)**

This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Timeseries dataset.

### **[Joined DataFlow](joined-dataflow/readme.md)**

This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Joined dataset.



11 changes: 11 additions & 0 deletions training/heterogeneous-clusters/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.venv/
.DS_Store
data/MyMNIST
pt.grpc.local/data/*
pt.grpc.local/__pycache__
pt.grpc.local/profile
tf.data.service.sagemaker/data
tf.data.service.sagemaker/code/__pycache__
tf.data.service.local/data
pt.grpc.sagemaker/data
tf.data.service.sagemaker/__pycache__
38 changes: 38 additions & 0 deletions training/heterogeneous-clusters/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Heterogeneous Clusters
SageMaker Training Heterogeneous Clusters allows you to run one training job
that includes instances of different types. For example a GPU instance like
ml.p4d.24xlarge and a CPU instance like c5.18xlarge.

One primary use case is offloading CPU intensive tasks like image
pre-processing (data augmentation) from the GPU instance to a dedicate
CPU instance, so you can fully utilize the expensive GPUs, and arrive at
an improved time and cost to train.

You'll find TensorFlow (tf.data.service) and PyTorch (a customer gRPC based distributed data loading) examples on how to utilize Heterogeneous clusters in your training jobs. You can reuse these examples when enabling your own training workload to use heterogeneous clusters.

![Hetero job diagram](tf.data.service.sagemaker/images/basic-heterogeneous-job.png)

## Examples:

### TensorFlow examples
- [**TensorFlow's tf.data.service running locally**](tf.data.service.local/README.md):
This example runs the tf.data.service locally on your machine (not on SageMaker). It's helpful in order to get familiar with tf.data.service and to run small scale quick experimentation.

- [**TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters**](tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb):
This TensorFlow example runs a Homogenous trainign job and compares its results with a Heterogeneous Clusters SageMaker training job that runs with two instance groups:
- `data_group` - this group has two ml.c5.18xlarge instances to which data augmentation is offloaded.
- `dnn_group` - Running one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution.

### PyTorch examples
- [**PyTorch with gRPC distributed dataloader running locally**](pt.grpc.local/README.md):
This Pytorch example runs a training job split into two processes locally on your machine (not on SageMaker). It's helpful in order to get familiar with the GRPC distributed data loader and to run small scale quick experimentation.

- [**PyTorch with gRPC distributed dataloader Heterogeneous Clusters training job example**](pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb):
This PyTorch example runs a Hetero SageMaker training job that uses gRPC to offload data augmentation to a CPU based server.


### Hello world example
- [**Hetero Training Job - Hello world**](hello.world.sagemaker/README.md):
This basic example run a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type.
Each instance prints its instance group information and exits.
Note: This example only shows how to orchastrate the training job with instance type, for actual code to help with a distributed data loader, see the TF or PT examples below.
Loading

0 comments on commit f1729fe

Please sign in to comment.