-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Heterogeneous Clusters example for TensorFlow and PyTorch #3599
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
def start_child_process(name : str, additional_args=[]) -> int: | ||
params = ["python", f"./{name}"] + sys.argv[1:] + additional_args | ||
print(f'Opening process: {params}') | ||
p = subprocess.run(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Passing user-provided input to eval
and exec
functions without sanitization makes your code vulnerable to code injection. Make sure you implement input validation or use secure functions. Learn more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is not relevant in the current context which is executing the original command line SageMaker training toolkit ask for, just with an additional parameter.
How do I add a relevant ignore?
queuing_process.terminate() | ||
logger.info('Shutdown done.') | ||
import os, time | ||
os.system('kill -9 %d' % os.getpid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Starting a process with a shell, possible injection detected, security issue. https://bandit.readthedocs.io/en/latest/plugins/b605_start_process_with_a_shell.html
_sym_db.RegisterMessage(Example) | ||
|
||
_DATASETFEED = DESCRIPTOR.services_by_name['DatasetFeed'] | ||
if _descriptor._USE_C_DESCRIPTORS == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
The ==
and !=
operators use the compared objects' __eq__
method to test if they are equal. To check if an object is a singleton, such as None
, we recommend that you use the is
identity comparison operator.
return size | ||
|
||
def get_stub(self): | ||
channel = grpc.insecure_channel(f'{self.data_host}:6000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Problem
This line of code might contain a resource leak. Resource leaks can cause your system to slow down or crash.
Fix
Consider closing the following resource: channel. The resource is allocated by call grpc.insecure_channel. Currently, there are execution paths that do not contain closure statements, for example, when grpc.channel_ready_future() throws an exception. Close channel in a try-finally block to prevent the resource leak.
timeout=1) | ||
#print(f'DEBUG: Added example to queue') | ||
added = True | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Try, Except, Continue detected. https://bandit.readthedocs.io/en/latest/plugins/b112_try_except_continue.html
|
||
|
||
def install_dependencies(): | ||
from subprocess import call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Consider possible security implications associated with call module. https://bandit.readthedocs.io/en/latest/blacklists/blacklist_imports.html#b404-import-subprocess
conn, addr = s.accept() | ||
print('Received shutdown signal from: ', addr) | ||
try: | ||
conn.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To close a socket and immediately release its associated resources, call socket.shutdown()
before you call socket.close()
.
Similar issue at line number 14.
print('Shutting down data service via port {}'.format(SHUTDOWN_PORT)) | ||
import socket | ||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
s.connect(('localhost', SHUTDOWN_PORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To ensure your Python socket has a timeout, call socket.socket.settimeout()
before you call socket.socket.connect()
. A new Python socket by default doesn't have a timeout (its timeout defaults to None
)
import os | ||
import time | ||
from typing import Optional | ||
import subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Consider possible security implications associated with subprocess module. https://bandit.readthedocs.io/en/latest/blacklists/blacklist_imports.html#b404-import-subprocess
import socket | ||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
s.connect(('localhost', SHUTDOWN_PORT)) | ||
s.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To close a socket and immediately release its associated resources, call socket.shutdown()
before you call socket.close()
.
conn, addr = s.accept() | ||
print('Received shutdown signal from: ', addr) | ||
try: | ||
conn.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To close a socket and immediately release its associated resources, call socket.shutdown()
before you call socket.close()
.
Similar issue at line number 137.
import os, time | ||
os.system('kill %d' % os.getpid()) | ||
time.sleep(2) | ||
os.system('kill -9 %d' % os.getpid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Starting a process with a shell, possible injection detected, security issue. https://bandit.readthedocs.io/en/latest/plugins/b605_start_process_with_a_shell.html
|
||
|
||
def install_dependencies(): | ||
from subprocess import call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Consider possible security implications associated with call module. https://bandit.readthedocs.io/en/latest/blacklists/blacklist_imports.html#b404-import-subprocess
print(f'Shutting down tf.data.service dispatcher via: [{dispatcher_host}:{SHUTDOWN_PORT}]') | ||
import socket | ||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
s.connect((dispatcher_host, SHUTDOWN_PORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To ensure your Python socket has a timeout, call socket.socket.settimeout()
before you call socket.socket.connect()
. A new Python socket by default doesn't have a timeout (its timeout defaults to None
)
print(f'Shutting down data service dispatcher via: [{dispatcher_host}:{SHUTDOWN_PORT}]') | ||
import socket | ||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
s.connect((dispatcher_host, SHUTDOWN_PORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To ensure your Python socket has a timeout, call socket.socket.settimeout()
before you call socket.socket.connect()
. A new Python socket by default doesn't have a timeout (its timeout defaults to None
)
|
||
def get_stub(self): | ||
host = 'localhost' | ||
channel = grpc.insecure_channel(f'{host}:6000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Problem
This line of code might contain a resource leak. Resource leaks can cause your system to slow down or crash.
Fix
Consider closing the following resource: channel. The resource is allocated by call grpc.insecure_channel. Currently, there are execution paths that do not contain closure statements, for example, when grpc.channel_ready_future() throws an exception. Close channel in a try-finally block to prevent the resource leak.
queuing_process.terminate() | ||
logger.info('Shutdown done.') | ||
import os, time | ||
os.system('kill -9 %d' % os.getpid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
This line of code makes an outdated API call to start and communicate with processes. We recommend that you use the subprocess
module to start new processes, connect with their pipes, and get their return codes.
SHUTDOWN_PORT = 16000 | ||
import socket | ||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
s.bind(('', SHUTDOWN_PORT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Binding a socket with an empty IP address causes the bound address to default to 0.0.0.0. This might bind a socket to all interfaces, which opens the socket to traffic from any IPv4 address and creates security risks.Learn more
queuing_process.terminate() | ||
print('Shutdown done.') | ||
import os, time | ||
os.system('kill %d' % os.getpid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
This line of code makes an outdated API call to start and communicate with processes. We recommend that you use the subprocess
module to start new processes, connect with their pipes, and get their return codes.
Similar issue at line number 125.
import subprocess | ||
params = ["python", f"./{name}"] + sys.argv[1:] + additional_args | ||
print(f'Opening process: {params}') | ||
p = subprocess.run(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
Passing user-provided input to eval
and exec
functions without sanitization makes your code vulnerable to code injection. Make sure you implement input validation or use secure functions. Learn more
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
conn, addr = s.accept() | ||
print('Received shutdown signal from: ', addr) | ||
try: | ||
conn.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.
To close a socket and immediately release its associated resources, call socket.shutdown()
before you call socket.close()
.
Similar issue at line number 139.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
* Initial files to show Triton fil example with Training using RAPIDS a… (aws#3524) * Initial files to show Triton fil example with Training using RAPIDS and deploying ensemble for inference time using Conda * Applied review suggestions and corrected spelling, grammar, link references, and code to call proper wait method instead of creating our own * Fixed URL for when this will be posted to proper repo * Refined endpoint waiting logic * Changed wording of informational paragraphs * Update wait=True to ensure training job completes before tuning job is launched (aws#3538) * Deep ar forecast comparison notebooks (aws#3533) * Initial Draft of Forecasting Service Comparison Notebook * added DeepAR example * Cleaned up Example * DeepAR and Forecast Examples * Added util in response to comments * Added Notebook Series and Markdown * Edited Example Files * Changed README due to comments, modified util files by removing unnecessary functions and commented util files Co-authored-by: Jiang <[email protected]> * Added Model Registry Code (aws#3534) * added model registry code Added model registry code and updated the model deployment from model registry. * Black formatting completed * Black formatting completed. Resolved the comments Co-authored-by: Mani Khanuja <[email protected]> * Fix scikit_learn_data_processing_and_model_evaluation.ipynb (aws#3539) * enable optional steps to avoid error being raised in scikit_learn_data_processing_and_model_evaluation.ipynb * edit markdown * reformat * fix working-with-tfrecords.ipynb (aws#3542) * fix advanced_functionality/causal-inference/causal-inference-container.ipynb (aws#3544) * fix advanced_functionality/causal-inference/causal-inference-container.ipynb * fix login command * fix login * fix login * fix login Co-authored-by: EC2 Default User <[email protected]> * fix pipe_bring_your_own.ipynb (aws#3547) * fix pipe_bring_your_own.ipynb * login before pushing to docker * login before pushing to docker * fix login issues * fix login issues * revert login fix code Co-authored-by: EC2 Default User <[email protected]> * fix sagemaker-pipelines/time_series_forecasting/amazon_forecast_pipeline/sm_pipeline_with_amazon_forecast.ipynb (aws#3548) Co-authored-by: EC2 Default User <[email protected]> * rename FastAPI Example.ipynb (aws#3550) Co-authored-by: EC2 Default User <[email protected]> * fix RestRServe Example (aws#3553) * rename Plumber Example.ipynb (aws#3551) Co-authored-by: EC2 Default User <[email protected]> * change: Update callback step notebook as per recent sdk changes and fix existing issues (aws#3516) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Julia Kroll <[email protected]> * Implement Kendra search in RTD website (aws#3537) * implement unified search in RTD website * add sagemaker-debugger rtd to unified search * add licensing information * add licensing information * add licensing information * add licensing information * Added local mode notebook (aws#3549) * Added local mode notebook * Updated local mode notebook * Updated sklearn version. Added conclusion * Fixed whitespace issue Co-authored-by: Julia Kroll <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3554) Co-authored-by: atqy <[email protected]> * fix multi_model_catboost.ipynb (aws#3561) Co-authored-by: EC2 Default User <[email protected]> * fix scikit_bring_your_own.ipynb (aws#3552) * fix scikit_bring_your_own.ipynb * debug * debug * debug * debug * cleanup * cleanup * cleanup Co-authored-by: EC2 Default User <[email protected]> * fix tune_r_bring_your_own.ipynb (aws#3562) * delete r_examples/r_api_serving_examples (aws#3564) * delete paddlepaddle_sentiment_analysis_byo_mms (aws#3565) * Fix 'JSONLines' -> 'JSON Lines' (aws#3558) Co-authored-by: atqy <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3555) Co-authored-by: atqy <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3556) Co-authored-by: atqy <[email protected]> * Update the studio kernal notebook to TF 2.6 (aws#3568) Changed the studio notebook TF 2.6 Verified the changes by local testing * update pytorch DLC version to 1.11 in pytorch mnist sample (aws#3574) * update pytorch DLC version to 1.11 The notebook fails with current 1.8 pytorch. I think its a problem with the torchvision installed in the container. ``` AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 mnist.py --backend gloo --epochs 1" INFO:__main__:Initialized the distributed environment: 'gloo' backend on 2 nodes. Current host rank is 0. Number of gpus: 0 INFO:__main__:Get train data loader Traceback (most recent call last): File "mnist.py", line 257, in <module> train(parser.parse_args()) File "mnist.py", line 114, in train train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs) File "mnist.py", line 48, in _get_train_data_loader [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] File "/opt/conda/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 83, in __init__ ' You can use download=True to download it') RuntimeError: Dataset not found. You can use download=True to download it, exit code: 1 ``` * formatting * l = 100 * fix rapids_sagemaker_hpo.ipynb (aws#3545) * fix batch_transform_pca_dbscan_movie_clusters_notebook.ipynb (aws#3566) * fix batch_transform_pca_dbscan_movie_clusters.ipynb * lower test sample * cleanup * lower test percentage * lower test percentage * lower test percentage Co-authored-by: EC2 Default User <[email protected]> * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset (aws#3573) * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset * Add SageMaker Autopilot and Neo4j portfolio churn notebook. (aws#3505) * Add SageMaker Autopilot and Neo4j portfolio churn notebook. * update table of contents for graph embedding notebook * correct link * newline * note on edgar, s3 * notes on ASG * url anonymized * spelling * use s3 * spelling * name for link * comment drop * formatting * 20 minutes * more descriptive va name * branding issues * remove extra comment * note on validation * conclusion * no more ' * brackets on URL * black-nb -l 100 sagemaker_autopilot_neo4j_portfolio_churn.ipynb * incorporate Julia changes to downloadNotebook function * performance issue * working with large notebook * clear outputs. run linter one more time * typo * render link * format * remove link * insert link * no dash * fiddling w link * maybe it's a bad character escape? * AutoPilot caps * camel case SageMaker * bucket specfics * Bump version to 4.4.9 from 4.4.8 * add stack name, disk size * add note per Aramide on stack delete. * note * typos Co-authored-by: Julia Kroll <[email protected]> * Updated the serialisation function for CSV (aws#3580) Fixed string formatting issue for inference * Built-in Algorithm: TensorFlow Image Classification (aws#3579) * TF IC notebook * TF IC notebook * TF IC notebook Co-authored-by: username <[email protected]> Co-authored-by: atqy <[email protected]> * Add RTD Search Filters (aws#3581) * add filters * correct search url * change search textbox * change search box text * remove AWS in AWS Dev Guide * cleanup * more cleanup * built-in algorithm - tensorflow image classification: Pull Cloudwatch logs (aws#3590) Co-authored-by: Vivek Madan <[email protected]> * Pipeline local mode (aws#3587) * Add notebook that transitions back to SageMaker managed pipeline after valid local mode pipeline. * Added comments about how to locate CloudWatch logs for Training step output. * Added optional lookup of SageMaker Execution Role for local laptop runs. * Renamed new notebook to name of pre-existing local-mode notebook. * Re-formatted code cells with black-nb; removed cell output. * Changed SKLearnProcessor framework version back to 1.0-1 * reformat Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * Add GPT large inference notebook (aws#3594) * CLI upgrade * reformat * grammatical changes Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> * Updating Training Compiler Single Node Multi GPU notebook to use HF-PT 1.11 (aws#3593) * Adding new CV notebook for distributed training with PT 1.11 * Upgrading notebook to demonstrate PT 1.11 capabilities * Removing stale files * Renaming notebook * Retry tests * Upgrading numpy and pandas installation * Minor correction in wording * Boto3 version notebook (aws#3597) * CLI upgrade * reformat * grammatical changes * boto3 version * boto3 version-with minor change * serving.perperties remove empty line * set env variable for tensor_parallel_degree * grammatic fix * black-nb * grammatical change * endpoint_name fix * "By" cap * minor change Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * Add TensorFlow Triton example (aws#3543) * Add CatBoost MME BYOC example * formatted * Resolving comment # 1 and 2 * Resolving comment # 1 and 2 * Resolving comment # 4 * Resolving clean up comment * Added comments about CatBoost and usage for MME * Reformatted the jupyter file * Added the container with the relevant py files * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file * Add TensorFlow Triton example * format TensorFlow Triton example * Action feedback * Fix link(s) to be descriptive * Formatted * Update delete cell Co-authored-by: rsgrewal <[email protected]> Co-authored-by: atqy <[email protected]> * SageMaker-Debugger PT zcc deprecation (aws#3591) * Updated CNN class activation example for PT 1.12 ZCC deprecation * Updated PyTorch MNIST script change example * updated iterative model pruning examples to PT 1.12 * Updated profiler examples to be nonzcc * Changed nll_loss to NLLLoss * Fixed build issues * Removed vscode metadata from notebooks * renamed experiments to be model specific * Add standalone visual object detection notebook. (aws#3586) * Add standalone visual object detection notebook. * Debug the upload issue - previously the CI test failed at uplaading .rec to s3. - use absolute path instead * Debug code change * Debug * Use aws s3 cp to upload data to s3 * Use aws s3 cp to upload data to s3 * Test will small number of training epochs. * Try to fix the opencv issue by using python3.8 * Try to fix the opencv issue - remove the 'opencv-python-headless<4.3' restriction * Downgrade opencv try to resolve the opencv issue. - ref: https://stackoverflow.com/a/72812857 * Update opencv version trying to resolve the AttributeError issue. * opendv-python 4.6.0.66 not working, change to 4.5.5.64 * Change to pytorch 1.8 python 3.6 kernel * Address all comments from the reviewer - move all behind-the-scene package installation to the beginning of the notebook - polish the README file and address all concerns from the reviewer * Change to pytorch 1.8 and python 3.6 kernel * Remove most outputs in the notebook. Co-authored-by: Tao Sun <[email protected]> * Add visual object detection notebook to README (aws#3605) Co-authored-by: atqy <[email protected]> * Sagemaker DataWrangler Samples addition (aws#3510) * Create readme.md * Add files via upload Joined flow added * Add files via upload * Add files via upload * Add files via upload * Delete TS-Workshop-Advanced.ipynb * Delete TS-Workshop-Cleanup.ipynb * Delete TS-Workshop.ipynb * Add files via upload Updated after the CI errors * Create test.txt * Add files via upload * Delete sagemaker-datawrangler/timeseries-dataflow/pictures directory * Delete timeseries.flow * Add files via upload * Add files via upload * Add files via upload * Update index.rst * Add files via upload Added rst file for joined * Add files via upload added tabular index.rst file * Add files via upload Uploaded index.rst for time series data * Delete sagemaker-datawrangler/tabular-dataflow/img directory Images are now in S3 bucket so deleting this * Update README.md updating image links with s3 links * Update and rename sagemaker-datawrangler/tabular-dataflow/Data-Exploration.md to sagemaker-datawrangler/tabular-dataflow/data-exploration/Data-Exploration.md updating image link and folder * Add files via upload uploading index.rst * Update and rename sagemaker-datawrangler/tabular-dataflow/Data-Import.md to sagemaker-datawrangler/tabular-dataflow/data-import/Data-Import.md updated image links * Add files via upload index.rst for data import * Update Data-Transformations.md * Rename sagemaker-datawrangler/tabular-dataflow/Data-Transformations.md to sagemaker-datawrangler/tabular-dataflow/data-transformations/Data-Transformations.md * Add files via upload * Update readme.md * Delete sagemaker-datawrangler/joined-dataflow/img directory * Update readme.md * Delete sagemaker-datawrangler/timeseries-dataflow/img directory * Update index.rst * Update index.rst Updated index.rst to link to other files * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update README.md referring to /readme.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Add files via upload * Add files via upload * Update index.rst * Create index.rst * Update index.rst * Update index.rst * Add files via upload * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Delete sagemaker-datawrangler/import-flow directory * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst added data wrangler to the prep section * Update index.rst * Update index.rst * Add files via upload Updated per comments from aqyt * Update explore_data.ipynb Updated per Amelia comment - present tense * Update index.rst Grammer * Update index.rst Grammer * Update index.rst * Update import-flow.md Co-authored-by: atqy <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Updated instructions to mention streamings jobs are not supported on GT Console (aws#3608) Co-authored-by: atqy <[email protected]> * "docker tag" call improvement (aws#3604) * CLI upgrade * reformat * grammatical changes * boto3 version * boto3 version-with minor change * serving.perperties remove empty line * set env variable for tensor_parallel_degree * grammatic fix * black-nb * grammatical change * endpoint_name fix * "By" cap * minor change * docker tag call improvement Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Update SageMaker Training Compiler Example Notebooks for PT1.11 (aws#3592) * update pytorch_single_gpu_single_node example notebooks * edit estimator from PyTorch to HuggingFace * update parameters and fix grammar for roberta-base and bert-base-cased notebook * update parameters for albert-base-v2 notebook and reformat it * fix grammar mistake * fix syntax errors and update albert-base-v2 analysis part * fix panda and numpy version * rerun tests * edit code format Co-authored-by: Bruce Zhang <[email protected]> Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: atqy <[email protected]> * Add ContainerConfig example comment to ir notebooks (aws#3600) * Add ContainerConfig example comment to ir notebooks * adding containerConfig md to rest of the notebooks * add containerConfig md and handle missing variantName * rerun pr tests * rerun pr tests * rerun pr tests * rerun pr tests Co-authored-by: Gary Wang <[email protected]> * Added Structure for Inferencing examples (aws#3602) * Inference recommender fix typos (aws#3226) * Changed FailedReason to FailureReason in JSON query * Fixed inference typo in failure print statements * replaced client with inference_client Co-authored-by: Aaron Markham <[email protected]> * Adding Heterogeneous Clusters example for TensorFlow and PyTorch (aws#3599) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST * [Feature]Add Online Explainability notebooks for SageMaker Clarify (aws#3613) * Add Online Explainability notebooks for SageMaker Clarify * Correcting text in clean-up sections of online explainability example notebooks * Updating install commands for captum and sagemaker pypy packages * debug captum installation * change instance type Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * updating rst files (aws#3619) * Added sentence transformers example with TensorRT and Triton Ensemble (aws#3615) * Added sentence transformers example with TensorRT and Triton Ensemble * Notebook changes to pass CI build * Grammar fixes and installing torch for CI build * Installing torch to pass CI build Co-authored-by: atqy <[email protected]> * Bump protobuf (aws#3616) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py) - [Commits](protocolbuffers/protobuf@v3.20.1...v3.20.2) --- updated-dependencies: - dependency-name: protobuf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Aaron Markham <[email protected]> * Fixing outofdate readme.md for heterogeneous clusters feature (aws#3617) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST * updating outofdate readme.md * Fix 'JSONLines' -> 'JSON Lines' (aws#3557) * Fix 'JSONLines' -> 'JSON Lines' * Open a subset of ~10k S3 files to reduce runtime Co-authored-by: Aaron Markham <[email protected]> * Update SMMP GPT sample (aws#3433) * update smp * update smp * fp16 change * minor fix * minor fix * pin transformer version * Update SMMP notebooks * update gpt2 script * update notebook * minor fix * minor fix * minor fix * minor fix * fix * update gptj script and noteboook * update memory tracker * minor fix * fix * fix gptj notebook * Update training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb Co-authored-by: Miyoung <[email protected]> * Fix typos&expressions * reformat Co-authored-by: Miyoung <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Add Sharded Data Parallel notebook (aws#3622) * add sdp notebook * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * review & add additional references * revert the title fix * Update README.md * run black-nb formatting * incorporate feedback * Update training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple-sharded-data-parallel.ipynb Co-authored-by: erinho <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Miyoung Choi <[email protected]> * JumpStart Tensorflow Object Detection algorithm notebook (aws#3624) * JumpStart Tensorflow Object Detection algorithm notebook * JumpStart Amazon Tensorflow notebook * typo fix * Update SageMaker Training Compiler MNMG Example Notebook for PT1.11 (aws#3611) * update mnmg notebook and test file * edit parameters for estimators * fix format * edit by comments and update learning rate * turn off amp * change dataset from sst2 to wikitext * edit package install and add comments for ptxla * fix comments * fix grammar Co-authored-by: BruceZhang@eitug <[email protected]> * Creating SageMaker Autopilot/Pipelines example. (aws#3627) * Creating SageMaker Autopilot/Pipelines example. * Applying black code formatter to notebook. Co-authored-by: atqy <[email protected]> * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. (aws#3623) * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. * Addressed comments for HPO integration. Co-authored-by: Aaron Markham <[email protected]> * Launch Feature - SageMaker Multi-model endpoints on GPU (aws#3625) * added MME with GPU code * added mme on gpu code * removed mme on gpu code * removed outputs from the notebook * added notebook metadata with gpu instance type * test * test * test * test * test * correct folder spelling Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * updated autoscaling metrics (aws#3633) * change the job names to be unified with all the other jobs in JumpStart (aws#3631) Co-authored-by: atqy <[email protected]> * [FEATURE] Add SageMaker Pipeline local mode example with BYOC and FrameworkProcessor (aws#3614) * added framework-processor-local-pipelines * black-np on notebook * updated README.md * solving problems for commit id fc80e0d * solved formatting problem in notebook * reviewed notebook content, added dataset description, download dataset ffrom public sagemaker s3 bucket * grammar check * changed dataset to synthetic transactions dataset * removed reference to dataset origin * updated to main branch * fixing grammar spell Co-authored-by: Aaron Markham <[email protected]> * updated sagemaker triton to v22.09 (aws#3634) * updated sagemaker triton to v22.09 * black nb format notebook Co-authored-by: atqy <[email protected]> * Reverting to v22.07 (aws#3637) * reverting to v22.07 * fixed formating issue * added images to fix format issue * Pipeline Step Caching Example Notebook (aws#3638) * feature: pipeline caching notebook example * change: initialize notebook * feature: pipeline caching notebook example and tuning notebook adjustment * fix: example notebook * change: README * fix: notebook code * fix: grammar * fix: more grammar * fix: pr syntax and remove dataset * fix: updated paths * fix: tuning notebook formatting * fix: more path corrections Co-authored-by: Brock Wade <[email protected]> * change: Pipeline Caching Example Notebook Improvements (aws#3640) * feature: pipeline caching notebook example * change: initialize notebook * feature: pipeline caching notebook example and tuning notebook adjustment * fix: example notebook * change: README * fix: notebook code * fix: grammar * fix: more grammar * fix: pr syntax and remove dataset * fix: updated paths * fix: tuning notebook formatting * fix: more path corrections * feature: more commentary, notebook improvements * fix: grammar * fix: use present tense Co-authored-by: Brock Wade <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: James Park <[email protected]> Co-authored-by: Shreya Pandit <[email protected]> Co-authored-by: byj-aws <[email protected]> Co-authored-by: Jiang <[email protected]> Co-authored-by: rsgrewal-aws <[email protected]> Co-authored-by: Mani Khanuja <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Julia Kroll <[email protected]> Co-authored-by: Kirit Thadaka <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Suraj Kota <[email protected]> Co-authored-by: Xin Huang <[email protected]> Co-authored-by: Ben Lackey <[email protected]> Co-authored-by: duk-amz <[email protected]> Co-authored-by: khetan2 <[email protected]> Co-authored-by: username <[email protected]> Co-authored-by: vivekmadan2 <[email protected]> Co-authored-by: Vivek Madan <[email protected]> Co-authored-by: Paul Hargis <[email protected]> Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: Loki <[email protected]> Co-authored-by: Marc Karp <[email protected]> Co-authored-by: rsgrewal <[email protected]> Co-authored-by: Jihyeong Lee <[email protected]> Co-authored-by: Tao Sun <[email protected]> Co-authored-by: Tao Sun <[email protected]> Co-authored-by: neelamkoshiya <[email protected]> Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: Parth Brahmbhatt <[email protected]> Co-authored-by: Dingheng (Bruce) Zhang <[email protected]> Co-authored-by: Bruce Zhang <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Noah Luna <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Aman Malhotra <[email protected]> Co-authored-by: AnushaVelumani <[email protected]> Co-authored-by: João Moura <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: haohanchen-yagao <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Erin <[email protected]> Co-authored-by: erinho <[email protected]> Co-authored-by: Miyoung Choi <[email protected]> Co-authored-by: Marcelo Aberle <[email protected]> Co-authored-by: Choucri Bechir <[email protected]> Co-authored-by: evikram <[email protected]> Co-authored-by: Bruno Pistone <[email protected]> Co-authored-by: Brock Wade <[email protected]> Co-authored-by: Brock Wade <[email protected]>
…#3599) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST
…#3599) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST
Description of changes: Heterogeneous Clusters for Amazon SageMaker model training was announced July 2022. We're providing extensive code examples for using the feature with TensorFlow and PyTorch.
Testing done: Run the TensorFlow and PyTorch notebooks (2 in total).
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.black-nb -l 100 {path}/{notebook-name}.ipynb
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.