-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. #3623
Conversation
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, fixed.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #58. training_of_model_to_be_hosted = training_job_name
I believe this is redundant, since you set this variable in hosting cell.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removing. This is a left-over from a refactoring I made before as I was cleaning up my initial implementation.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #62. status = client.describe_training_job(TrainingJobName=job_name)["TrainingJobStatus"]
The param passed in describe_training_job() call, shouldn't be training_job_name
?. Did you have any issue during Notebook execution?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - I don't recall having any issues but I do when I try again now. The last thing I did before creating the PR was to rename a couple of variables (to split job_name
into tuning_job_name
and training_job_name
) and it seems like I forgot to run the entire notebook again after that.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #61. "InputDataConfig": [
To reduce verbosity and lines counts, Can you export to a common object and re-use between training and tuning, shared fields from training_job_definition, sush as: InputDataConfig, OutputDataConfig, ResourceConfig, RoleArn
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but I then decided that it was acceptable to have a bit of redundancy, if it helps the customer. In my mind, it would be easier for the customer to have all training or tuning related config in a single section that they can quickly copy paste, instead of having to piece things together from different cells.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point, but also I am thinking, that re-using fields, makes more evident to the user, that a lot of the configuration of TuningJob is actually common with TrainingJob, so they can focus more on the "actual" tuning configs. But I trust your instinct on that.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. tuning_job_config = {
Since this is meant to be an introductory notebook, it might be beneficial to add some basic exposition around how the tuning config is being defined here or even some links to HPO documentation to learn more outside of the notebook. To someone with no experience they might not even know where gamma, eta, etc. are coming from.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll add a comment about that at the top of the section, and link this page https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html which explains all the points you raised).
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #51. # Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
Do we know what the default is? We don't have to note it here necessarily, but maybe it's worth calling out in case users are tempted to try a ridiculous number.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they're explained here https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html and I will add this link right before the ResourceLimits
key.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #52. # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
If there's a concise way to explain the practical effect this might have on this model that might be helpful and/or interesting to readers - not a big deal, though.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation is the following :
For Random Search there’s no cost. It’s just better to use a higher MaxParallelTrainingJobs (up to the limits) for maximum speed.
For Bayesian search, the cost is that the higher MaxParallelTrainingJobs, the more similar Bayesian becomes to Random Search. The reason is that Bayesian won’t have all the sequential information to make the best decision on the hyperparameter to pick next. This means it may need more training jobs overall to find the optimum with a high parallel factor
A bit hard to summarise the above in a concise way but part of it is in a way already mentioned.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #108. print(status)
If this is going to run significantly longer than the regular training job it might be good just to note that in a comment here so users know what to expect. See below where it's noted "This takes 9-11 minutes to complete." for endpoint deployment.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an opportunity to compare HPO tuned models and regularly trained models here in terms of prediction outputs?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, and it would help customers quantify the impact of using HPO vs Training, but this will require creating a separate endpoint for the model trained with Training (not a problem), and in general will increase the scope of the notebook. We could instead decide to have a dedicated "HPO vs Training" notebook whose purpose is to show that better performance could be achieved with a tuned model. I'd refrain though from adding this comparison in all the notebooks that I'm modifying (50+). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you said makes sense. Maybe after we've created this versus notebook we can come back and add a reference to it in a comment or something.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@@ -2,7 +2,9 @@ | |||
"cells": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes.
"three parameters" instead "3 attributes."
(to specify settings...)
"(which specifies settings..."
(to configure the...
"(which configures the..."
To learn more about that,
"these" instead of "that"
These are nits, feel free to ignore.
Reply via ReviewNB
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
* Initial files to show Triton fil example with Training using RAPIDS a… (aws#3524) * Initial files to show Triton fil example with Training using RAPIDS and deploying ensemble for inference time using Conda * Applied review suggestions and corrected spelling, grammar, link references, and code to call proper wait method instead of creating our own * Fixed URL for when this will be posted to proper repo * Refined endpoint waiting logic * Changed wording of informational paragraphs * Update wait=True to ensure training job completes before tuning job is launched (aws#3538) * Deep ar forecast comparison notebooks (aws#3533) * Initial Draft of Forecasting Service Comparison Notebook * added DeepAR example * Cleaned up Example * DeepAR and Forecast Examples * Added util in response to comments * Added Notebook Series and Markdown * Edited Example Files * Changed README due to comments, modified util files by removing unnecessary functions and commented util files Co-authored-by: Jiang <[email protected]> * Added Model Registry Code (aws#3534) * added model registry code Added model registry code and updated the model deployment from model registry. * Black formatting completed * Black formatting completed. Resolved the comments Co-authored-by: Mani Khanuja <[email protected]> * Fix scikit_learn_data_processing_and_model_evaluation.ipynb (aws#3539) * enable optional steps to avoid error being raised in scikit_learn_data_processing_and_model_evaluation.ipynb * edit markdown * reformat * fix working-with-tfrecords.ipynb (aws#3542) * fix advanced_functionality/causal-inference/causal-inference-container.ipynb (aws#3544) * fix advanced_functionality/causal-inference/causal-inference-container.ipynb * fix login command * fix login * fix login * fix login Co-authored-by: EC2 Default User <[email protected]> * fix pipe_bring_your_own.ipynb (aws#3547) * fix pipe_bring_your_own.ipynb * login before pushing to docker * login before pushing to docker * fix login issues * fix login issues * revert login fix code Co-authored-by: EC2 Default User <[email protected]> * fix sagemaker-pipelines/time_series_forecasting/amazon_forecast_pipeline/sm_pipeline_with_amazon_forecast.ipynb (aws#3548) Co-authored-by: EC2 Default User <[email protected]> * rename FastAPI Example.ipynb (aws#3550) Co-authored-by: EC2 Default User <[email protected]> * fix RestRServe Example (aws#3553) * rename Plumber Example.ipynb (aws#3551) Co-authored-by: EC2 Default User <[email protected]> * change: Update callback step notebook as per recent sdk changes and fix existing issues (aws#3516) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Julia Kroll <[email protected]> * Implement Kendra search in RTD website (aws#3537) * implement unified search in RTD website * add sagemaker-debugger rtd to unified search * add licensing information * add licensing information * add licensing information * add licensing information * Added local mode notebook (aws#3549) * Added local mode notebook * Updated local mode notebook * Updated sklearn version. Added conclusion * Fixed whitespace issue Co-authored-by: Julia Kroll <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3554) Co-authored-by: atqy <[email protected]> * fix multi_model_catboost.ipynb (aws#3561) Co-authored-by: EC2 Default User <[email protected]> * fix scikit_bring_your_own.ipynb (aws#3552) * fix scikit_bring_your_own.ipynb * debug * debug * debug * debug * cleanup * cleanup * cleanup Co-authored-by: EC2 Default User <[email protected]> * fix tune_r_bring_your_own.ipynb (aws#3562) * delete r_examples/r_api_serving_examples (aws#3564) * delete paddlepaddle_sentiment_analysis_byo_mms (aws#3565) * Fix 'JSONLines' -> 'JSON Lines' (aws#3558) Co-authored-by: atqy <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3555) Co-authored-by: atqy <[email protected]> * Fix 'JSONLines' -> 'JSON Lines' (aws#3556) Co-authored-by: atqy <[email protected]> * Update the studio kernal notebook to TF 2.6 (aws#3568) Changed the studio notebook TF 2.6 Verified the changes by local testing * update pytorch DLC version to 1.11 in pytorch mnist sample (aws#3574) * update pytorch DLC version to 1.11 The notebook fails with current 1.8 pytorch. I think its a problem with the torchvision installed in the container. ``` AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.6 mnist.py --backend gloo --epochs 1" INFO:__main__:Initialized the distributed environment: 'gloo' backend on 2 nodes. Current host rank is 0. Number of gpus: 0 INFO:__main__:Get train data loader Traceback (most recent call last): File "mnist.py", line 257, in <module> train(parser.parse_args()) File "mnist.py", line 114, in train train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs) File "mnist.py", line 48, in _get_train_data_loader [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))] File "/opt/conda/lib/python3.6/site-packages/torchvision/datasets/mnist.py", line 83, in __init__ ' You can use download=True to download it') RuntimeError: Dataset not found. You can use download=True to download it, exit code: 1 ``` * formatting * l = 100 * fix rapids_sagemaker_hpo.ipynb (aws#3545) * fix batch_transform_pca_dbscan_movie_clusters_notebook.ipynb (aws#3566) * fix batch_transform_pca_dbscan_movie_clusters.ipynb * lower test sample * cleanup * lower test percentage * lower test percentage * lower test percentage Co-authored-by: EC2 Default User <[email protected]> * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset (aws#3573) * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset * add new example notebook to compare sagemaker lightgbm catboost autogluon and tabtransformer with AMT on customer churn dataset * Add SageMaker Autopilot and Neo4j portfolio churn notebook. (aws#3505) * Add SageMaker Autopilot and Neo4j portfolio churn notebook. * update table of contents for graph embedding notebook * correct link * newline * note on edgar, s3 * notes on ASG * url anonymized * spelling * use s3 * spelling * name for link * comment drop * formatting * 20 minutes * more descriptive va name * branding issues * remove extra comment * note on validation * conclusion * no more ' * brackets on URL * black-nb -l 100 sagemaker_autopilot_neo4j_portfolio_churn.ipynb * incorporate Julia changes to downloadNotebook function * performance issue * working with large notebook * clear outputs. run linter one more time * typo * render link * format * remove link * insert link * no dash * fiddling w link * maybe it's a bad character escape? * AutoPilot caps * camel case SageMaker * bucket specfics * Bump version to 4.4.9 from 4.4.8 * add stack name, disk size * add note per Aramide on stack delete. * note * typos Co-authored-by: Julia Kroll <[email protected]> * Updated the serialisation function for CSV (aws#3580) Fixed string formatting issue for inference * Built-in Algorithm: TensorFlow Image Classification (aws#3579) * TF IC notebook * TF IC notebook * TF IC notebook Co-authored-by: username <[email protected]> Co-authored-by: atqy <[email protected]> * Add RTD Search Filters (aws#3581) * add filters * correct search url * change search textbox * change search box text * remove AWS in AWS Dev Guide * cleanup * more cleanup * built-in algorithm - tensorflow image classification: Pull Cloudwatch logs (aws#3590) Co-authored-by: Vivek Madan <[email protected]> * Pipeline local mode (aws#3587) * Add notebook that transitions back to SageMaker managed pipeline after valid local mode pipeline. * Added comments about how to locate CloudWatch logs for Training step output. * Added optional lookup of SageMaker Execution Role for local laptop runs. * Renamed new notebook to name of pre-existing local-mode notebook. * Re-formatted code cells with black-nb; removed cell output. * Changed SKLearnProcessor framework version back to 1.0-1 * reformat Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * Add GPT large inference notebook (aws#3594) * CLI upgrade * reformat * grammatical changes Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> * Updating Training Compiler Single Node Multi GPU notebook to use HF-PT 1.11 (aws#3593) * Adding new CV notebook for distributed training with PT 1.11 * Upgrading notebook to demonstrate PT 1.11 capabilities * Removing stale files * Renaming notebook * Retry tests * Upgrading numpy and pandas installation * Minor correction in wording * Boto3 version notebook (aws#3597) * CLI upgrade * reformat * grammatical changes * boto3 version * boto3 version-with minor change * serving.perperties remove empty line * set env variable for tensor_parallel_degree * grammatic fix * black-nb * grammatical change * endpoint_name fix * "By" cap * minor change Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * Add TensorFlow Triton example (aws#3543) * Add CatBoost MME BYOC example * formatted * Resolving comment # 1 and 2 * Resolving comment # 1 and 2 * Resolving comment # 4 * Resolving clean up comment * Added comments about CatBoost and usage for MME * Reformatted the jupyter file * Added the container with the relevant py files * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file * Added formatting using Black. Also fixed the comments from the Jupyter file * Add TensorFlow Triton example * format TensorFlow Triton example * Action feedback * Fix link(s) to be descriptive * Formatted * Update delete cell Co-authored-by: rsgrewal <[email protected]> Co-authored-by: atqy <[email protected]> * SageMaker-Debugger PT zcc deprecation (aws#3591) * Updated CNN class activation example for PT 1.12 ZCC deprecation * Updated PyTorch MNIST script change example * updated iterative model pruning examples to PT 1.12 * Updated profiler examples to be nonzcc * Changed nll_loss to NLLLoss * Fixed build issues * Removed vscode metadata from notebooks * renamed experiments to be model specific * Add standalone visual object detection notebook. (aws#3586) * Add standalone visual object detection notebook. * Debug the upload issue - previously the CI test failed at uplaading .rec to s3. - use absolute path instead * Debug code change * Debug * Use aws s3 cp to upload data to s3 * Use aws s3 cp to upload data to s3 * Test will small number of training epochs. * Try to fix the opencv issue by using python3.8 * Try to fix the opencv issue - remove the 'opencv-python-headless<4.3' restriction * Downgrade opencv try to resolve the opencv issue. - ref: https://stackoverflow.com/a/72812857 * Update opencv version trying to resolve the AttributeError issue. * opendv-python 4.6.0.66 not working, change to 4.5.5.64 * Change to pytorch 1.8 python 3.6 kernel * Address all comments from the reviewer - move all behind-the-scene package installation to the beginning of the notebook - polish the README file and address all concerns from the reviewer * Change to pytorch 1.8 and python 3.6 kernel * Remove most outputs in the notebook. Co-authored-by: Tao Sun <[email protected]> * Add visual object detection notebook to README (aws#3605) Co-authored-by: atqy <[email protected]> * Sagemaker DataWrangler Samples addition (aws#3510) * Create readme.md * Add files via upload Joined flow added * Add files via upload * Add files via upload * Add files via upload * Delete TS-Workshop-Advanced.ipynb * Delete TS-Workshop-Cleanup.ipynb * Delete TS-Workshop.ipynb * Add files via upload Updated after the CI errors * Create test.txt * Add files via upload * Delete sagemaker-datawrangler/timeseries-dataflow/pictures directory * Delete timeseries.flow * Add files via upload * Add files via upload * Add files via upload * Update index.rst * Add files via upload Added rst file for joined * Add files via upload added tabular index.rst file * Add files via upload Uploaded index.rst for time series data * Delete sagemaker-datawrangler/tabular-dataflow/img directory Images are now in S3 bucket so deleting this * Update README.md updating image links with s3 links * Update and rename sagemaker-datawrangler/tabular-dataflow/Data-Exploration.md to sagemaker-datawrangler/tabular-dataflow/data-exploration/Data-Exploration.md updating image link and folder * Add files via upload uploading index.rst * Update and rename sagemaker-datawrangler/tabular-dataflow/Data-Import.md to sagemaker-datawrangler/tabular-dataflow/data-import/Data-Import.md updated image links * Add files via upload index.rst for data import * Update Data-Transformations.md * Rename sagemaker-datawrangler/tabular-dataflow/Data-Transformations.md to sagemaker-datawrangler/tabular-dataflow/data-transformations/Data-Transformations.md * Add files via upload * Update readme.md * Delete sagemaker-datawrangler/joined-dataflow/img directory * Update readme.md * Delete sagemaker-datawrangler/timeseries-dataflow/img directory * Update index.rst * Update index.rst Updated index.rst to link to other files * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update README.md referring to /readme.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Add files via upload * Add files via upload * Update index.rst * Create index.rst * Update index.rst * Update index.rst * Add files via upload * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Delete sagemaker-datawrangler/import-flow directory * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst * Update index.rst added data wrangler to the prep section * Update index.rst * Update index.rst * Add files via upload Updated per comments from aqyt * Update explore_data.ipynb Updated per Amelia comment - present tense * Update index.rst Grammer * Update index.rst Grammer * Update index.rst * Update import-flow.md Co-authored-by: atqy <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Updated instructions to mention streamings jobs are not supported on GT Console (aws#3608) Co-authored-by: atqy <[email protected]> * "docker tag" call improvement (aws#3604) * CLI upgrade * reformat * grammatical changes * boto3 version * boto3 version-with minor change * serving.perperties remove empty line * set env variable for tensor_parallel_degree * grammatic fix * black-nb * grammatical change * endpoint_name fix * "By" cap * minor change * docker tag call improvement Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Update SageMaker Training Compiler Example Notebooks for PT1.11 (aws#3592) * update pytorch_single_gpu_single_node example notebooks * edit estimator from PyTorch to HuggingFace * update parameters and fix grammar for roberta-base and bert-base-cased notebook * update parameters for albert-base-v2 notebook and reformat it * fix grammar mistake * fix syntax errors and update albert-base-v2 analysis part * fix panda and numpy version * rerun tests * edit code format Co-authored-by: Bruce Zhang <[email protected]> Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: atqy <[email protected]> * Add ContainerConfig example comment to ir notebooks (aws#3600) * Add ContainerConfig example comment to ir notebooks * adding containerConfig md to rest of the notebooks * add containerConfig md and handle missing variantName * rerun pr tests * rerun pr tests * rerun pr tests * rerun pr tests Co-authored-by: Gary Wang <[email protected]> * Added Structure for Inferencing examples (aws#3602) * Inference recommender fix typos (aws#3226) * Changed FailedReason to FailureReason in JSON query * Fixed inference typo in failure print statements * replaced client with inference_client Co-authored-by: Aaron Markham <[email protected]> * Adding Heterogeneous Clusters example for TensorFlow and PyTorch (aws#3599) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST * [Feature]Add Online Explainability notebooks for SageMaker Clarify (aws#3613) * Add Online Explainability notebooks for SageMaker Clarify * Correcting text in clean-up sections of online explainability example notebooks * Updating install commands for captum and sagemaker pypy packages * debug captum installation * change instance type Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * updating rst files (aws#3619) * Added sentence transformers example with TensorRT and Triton Ensemble (aws#3615) * Added sentence transformers example with TensorRT and Triton Ensemble * Notebook changes to pass CI build * Grammar fixes and installing torch for CI build * Installing torch to pass CI build Co-authored-by: atqy <[email protected]> * Bump protobuf (aws#3616) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/generate_changelog.py) - [Commits](protocolbuffers/protobuf@v3.20.1...v3.20.2) --- updated-dependencies: - dependency-name: protobuf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Aaron Markham <[email protected]> * Fixing outofdate readme.md for heterogeneous clusters feature (aws#3617) * initial commit * notebook fix and misspelling * add link from root readme.md * switching cifar-10 to artificial dataset for TF * adding retries to fit() * grammer fixes * remove cifar references * Removing local tf and pt execution exmaples * Add security group info for private VPC use case * Adding index.rst for heterogeneous clusters * fix PT notebook heading for rst * fix rst and notebook tables for rst * Adding programmatic kernel restart * removing programmatic kernel restart - breaks CI * Remove tables that don't render in RST * updating outofdate readme.md * Fix 'JSONLines' -> 'JSON Lines' (aws#3557) * Fix 'JSONLines' -> 'JSON Lines' * Open a subset of ~10k S3 files to reduce runtime Co-authored-by: Aaron Markham <[email protected]> * Update SMMP GPT sample (aws#3433) * update smp * update smp * fp16 change * minor fix * minor fix * pin transformer version * Update SMMP notebooks * update gpt2 script * update notebook * minor fix * minor fix * minor fix * minor fix * fix * update gptj script and noteboook * update memory tracker * minor fix * fix * fix gptj notebook * Update training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb Co-authored-by: Miyoung <[email protected]> * Fix typos&expressions * reformat Co-authored-by: Miyoung <[email protected]> Co-authored-by: Aaron Markham <[email protected]> * Add Sharded Data Parallel notebook (aws#3622) * add sdp notebook * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * minor fix Co-authored-by: Miyoung <[email protected]> * review & add additional references * revert the title fix * Update README.md * run black-nb formatting * incorporate feedback * Update training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple-sharded-data-parallel.ipynb Co-authored-by: erinho <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Miyoung Choi <[email protected]> * JumpStart Tensorflow Object Detection algorithm notebook (aws#3624) * JumpStart Tensorflow Object Detection algorithm notebook * JumpStart Amazon Tensorflow notebook * typo fix * Update SageMaker Training Compiler MNMG Example Notebook for PT1.11 (aws#3611) * update mnmg notebook and test file * edit parameters for estimators * fix format * edit by comments and update learning rate * turn off amp * change dataset from sst2 to wikitext * edit package install and add comments for ptxla * fix comments * fix grammar Co-authored-by: BruceZhang@eitug <[email protected]> * Creating SageMaker Autopilot/Pipelines example. (aws#3627) * Creating SageMaker Autopilot/Pipelines example. * Applying black code formatter to notebook. Co-authored-by: atqy <[email protected]> * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. (aws#3623) * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. * Addressed comments for HPO integration. Co-authored-by: Aaron Markham <[email protected]> * Launch Feature - SageMaker Multi-model endpoints on GPU (aws#3625) * added MME with GPU code * added mme on gpu code * removed mme on gpu code * removed outputs from the notebook * added notebook metadata with gpu instance type * test * test * test * test * test * correct folder spelling Co-authored-by: atqy <[email protected]> Co-authored-by: atqy <[email protected]> * updated autoscaling metrics (aws#3633) * change the job names to be unified with all the other jobs in JumpStart (aws#3631) Co-authored-by: atqy <[email protected]> * [FEATURE] Add SageMaker Pipeline local mode example with BYOC and FrameworkProcessor (aws#3614) * added framework-processor-local-pipelines * black-np on notebook * updated README.md * solving problems for commit id fc80e0d * solved formatting problem in notebook * reviewed notebook content, added dataset description, download dataset ffrom public sagemaker s3 bucket * grammar check * changed dataset to synthetic transactions dataset * removed reference to dataset origin * updated to main branch * fixing grammar spell Co-authored-by: Aaron Markham <[email protected]> * updated sagemaker triton to v22.09 (aws#3634) * updated sagemaker triton to v22.09 * black nb format notebook Co-authored-by: atqy <[email protected]> * Reverting to v22.07 (aws#3637) * reverting to v22.07 * fixed formating issue * added images to fix format issue * Pipeline Step Caching Example Notebook (aws#3638) * feature: pipeline caching notebook example * change: initialize notebook * feature: pipeline caching notebook example and tuning notebook adjustment * fix: example notebook * change: README * fix: notebook code * fix: grammar * fix: more grammar * fix: pr syntax and remove dataset * fix: updated paths * fix: tuning notebook formatting * fix: more path corrections Co-authored-by: Brock Wade <[email protected]> * change: Pipeline Caching Example Notebook Improvements (aws#3640) * feature: pipeline caching notebook example * change: initialize notebook * feature: pipeline caching notebook example and tuning notebook adjustment * fix: example notebook * change: README * fix: notebook code * fix: grammar * fix: more grammar * fix: pr syntax and remove dataset * fix: updated paths * fix: tuning notebook formatting * fix: more path corrections * feature: more commentary, notebook improvements * fix: grammar * fix: use present tense Co-authored-by: Brock Wade <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: James Park <[email protected]> Co-authored-by: Shreya Pandit <[email protected]> Co-authored-by: byj-aws <[email protected]> Co-authored-by: Jiang <[email protected]> Co-authored-by: rsgrewal-aws <[email protected]> Co-authored-by: Mani Khanuja <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Julia Kroll <[email protected]> Co-authored-by: Kirit Thadaka <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Suraj Kota <[email protected]> Co-authored-by: Xin Huang <[email protected]> Co-authored-by: Ben Lackey <[email protected]> Co-authored-by: duk-amz <[email protected]> Co-authored-by: khetan2 <[email protected]> Co-authored-by: username <[email protected]> Co-authored-by: vivekmadan2 <[email protected]> Co-authored-by: Vivek Madan <[email protected]> Co-authored-by: Paul Hargis <[email protected]> Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: Qingwei Li <[email protected]> Co-authored-by: Loki <[email protected]> Co-authored-by: Marc Karp <[email protected]> Co-authored-by: rsgrewal <[email protected]> Co-authored-by: Jihyeong Lee <[email protected]> Co-authored-by: Tao Sun <[email protected]> Co-authored-by: Tao Sun <[email protected]> Co-authored-by: neelamkoshiya <[email protected]> Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: Parth Brahmbhatt <[email protected]> Co-authored-by: Dingheng (Bruce) Zhang <[email protected]> Co-authored-by: Bruce Zhang <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Noah Luna <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Aman Malhotra <[email protected]> Co-authored-by: AnushaVelumani <[email protected]> Co-authored-by: João Moura <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: haohanchen-yagao <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Erin <[email protected]> Co-authored-by: erinho <[email protected]> Co-authored-by: Miyoung Choi <[email protected]> Co-authored-by: Marcelo Aberle <[email protected]> Co-authored-by: Choucri Bechir <[email protected]> Co-authored-by: evikram <[email protected]> Co-authored-by: Bruno Pistone <[email protected]> Co-authored-by: Brock Wade <[email protected]> Co-authored-by: Brock Wade <[email protected]>
…lone notebook. (aws#3623) * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. * Addressed comments for HPO integration. Co-authored-by: Aaron Markham <[email protected]>
…lone notebook. (aws#3623) * Integrate SageMaker Automatic Model Tuning (HPO) with one XGBoost Abalone notebook. * Addressed comments for HPO integration. Co-authored-by: Aaron Markham <[email protected]>
Issue #, if available:
This is not in response to an issue.
Description of changes:
HPO is increasing its attach to 1P Algorithms. One area of attach that was identified are the 1P Notebooks. This is the first PR in a series that will integrate HPO with the 1P Algorithms Notebooks.
Testing done:
I have tested that the modified notebook runs successfully end-to-end.
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.black-nb -l 100 {path}/{notebook-name}.ipynb
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.