-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Heterogeneous Clusters example for TensorFlow and PyTorch #3599
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
6b031e6
initial commit
gilinachum 9d50385
notebook fix and misspelling
gilinachum 06da2ce
add link from root readme.md
gilinachum bd479c6
switching cifar-10 to artificial dataset for TF
gilinachum fe7f8ce
adding retries to fit()
gilinachum 9004b21
Merge branch 'main' into main
gilinachum 790d99d
grammer fixes
gilinachum 38ef3f5
Merge branch 'main' of https://github.com/gilinachum/amazon-sagemaker…
gilinachum d47542b
remove cifar references
gilinachum 2da3cd9
Removing local tf and pt execution exmaples
gilinachum 9bc6d34
Add security group info for private VPC use case
gilinachum 3415bd9
Adding index.rst for heterogeneous clusters
gilinachum 0490d31
Merge branch 'aws:main' into main
gilinachum 280dbbd
fix PT notebook heading for rst
gilinachum f8ee5df
fix rst and notebook tables for rst
gilinachum 203861a
Merge branch 'main' of https://github.com/gilinachum/amazon-sagemaker…
gilinachum 307ae9d
Adding programmatic kernel restart
gilinachum c56262b
removing programmatic kernel restart - breaks CI
gilinachum 6b687d3
Remove tables that don't render in RST
gilinachum File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,41 @@ | ||
![Amazon SageMaker Data Wrangler](https://github.com/aws/amazon-sagemaker-examples/raw/main/_static/sagemaker-banner.png) | ||
|
||
# Amazon SageMaker Data Wrangler Examples | ||
|
||
Example flows that demonstrate how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler. | ||
|
||
## :books: Background | ||
|
||
[Amazon SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/) reduces the time it takes to aggregate and prepare data for ML. From a single interface in SageMaker Studio, you can import data from Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, and Amazon SageMaker Feature Store, and in just a few clicks SageMaker Data Wrangler will automatically load, aggregate, and display the raw data. It will then make conversion recommendations based on the source data, transform the data into new features, validate the features, and provide visualizations with recommendations on how to remove common sources of error such as incorrect labels. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines or import that data into Amazon SageMaker Feature Store. | ||
|
||
|
||
|
||
The [SageMaker example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/) are Jupyter notebooks that demonstrate the usage of Amazon SageMaker. | ||
|
||
## :hammer_and_wrench: Setup | ||
|
||
Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio. Use this section to learn how to access and get started using Data Wrangler. Do the following: | ||
|
||
* Complete each step in [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-prerequisite). | ||
|
||
* Follow the procedure in [Access Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-access) to start using Data Wrangler. | ||
|
||
|
||
|
||
|
||
## :notebook: Examples | ||
|
||
### **[Tabular DataFlow](tabular-dataflow/README.md)** | ||
|
||
This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Tabular dataset. | ||
|
||
### **[Timeseries DataFlow](timeseries-dataflow/readme.md)** | ||
|
||
This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Timeseries dataset. | ||
|
||
### **[Joined DataFlow](joined-dataflow/readme.md)** | ||
|
||
This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Joined dataset. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
.venv/ | ||
.DS_Store | ||
data/MyMNIST | ||
pt.grpc.local/data/* | ||
pt.grpc.local/__pycache__ | ||
pt.grpc.local/profile | ||
tf.data.service.sagemaker/data | ||
tf.data.service.sagemaker/code/__pycache__ | ||
tf.data.service.local/data | ||
pt.grpc.sagemaker/data | ||
tf.data.service.sagemaker/__pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Heterogeneous Clusters | ||
SageMaker Training Heterogeneous Clusters allows you to run one training job | ||
that includes instances of different types. For example a GPU instance like | ||
ml.p4d.24xlarge and a CPU instance like c5.18xlarge. | ||
|
||
One primary use case is offloading CPU intensive tasks like image | ||
pre-processing (data augmentation) from the GPU instance to a dedicate | ||
CPU instance, so you can fully utilize the expensive GPUs, and arrive at | ||
an improved time and cost to train. | ||
|
||
You'll find TensorFlow (tf.data.service) and PyTorch (a customer gRPC based distributed data loading) examples on how to utilize Heterogeneous clusters in your training jobs. You can reuse these examples when enabling your own training workload to use heterogeneous clusters. | ||
|
||
![Hetero job diagram](tf.data.service.sagemaker/images/basic-heterogeneous-job.png) | ||
|
||
## Examples: | ||
|
||
### TensorFlow examples | ||
- [**TensorFlow's tf.data.service running locally**](tf.data.service.local/README.md): | ||
This example runs the tf.data.service locally on your machine (not on SageMaker). It's helpful in order to get familiar with tf.data.service and to run small scale quick experimentation. | ||
|
||
- [**TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters**](tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb): | ||
This TensorFlow example runs a Homogenous trainign job and compares its results with a Heterogeneous Clusters SageMaker training job that runs with two instance groups: | ||
- `data_group` - this group has two ml.c5.18xlarge instances to which data augmentation is offloaded. | ||
- `dnn_group` - Running one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution. | ||
|
||
### PyTorch examples | ||
- [**PyTorch with gRPC distributed dataloader running locally**](pt.grpc.local/README.md): | ||
This Pytorch example runs a training job split into two processes locally on your machine (not on SageMaker). It's helpful in order to get familiar with the GRPC distributed data loader and to run small scale quick experimentation. | ||
|
||
- [**PyTorch with gRPC distributed dataloader Heterogeneous Clusters training job example**](pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb): | ||
This PyTorch example runs a Hetero SageMaker training job that uses gRPC to offload data augmentation to a CPU based server. | ||
|
||
|
||
### Hello world example | ||
- [**Hetero Training Job - Hello world**](hello.world.sagemaker/README.md): | ||
This basic example run a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type. | ||
Each instance prints its instance group information and exits. | ||
Note: This example only shows how to orchastrate the training job with instance type, for actual code to help with a distributed data loader, see the TF or PT examples below. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this change is here... seems to be the same file as what's in the repo now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. Some auto merging issue from incoming changes to the fork.