Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved: R BYO example to under_development after unexpected fetch erro… #91

Merged
merged 2 commits into from
Nov 27, 2017

Conversation

djarpin
Copy link
Contributor

@djarpin djarpin commented Nov 27, 2017

…r in container build

@djarpin djarpin merged commit 2f941d8 into master Nov 27, 2017
@djarpin djarpin deleted the arpin_r_byo_move branch November 30, 2017 16:51
ajaykarpur pushed a commit that referenced this pull request Dec 1, 2020
ajaykarpur added a commit that referenced this pull request Dec 1, 2020
* GluonCV YoloV3 Darknet53 example training and inference with Neo (#1266)

* upgrade MNIST experiment notebook to SDK v2 (#1576)

* GluonCV YoloV3 Darknet53 example minor fixes (#1582)

* Code cell type corrected. Removed empty cell

* Unzip datasets if not available in the notebook's folder

* fix invalid json in MNIST notetook (#1594)

* Kkoppolu inference examples (#1587)

* Compilation examples changes for new inference containers

Update examples for PyTorch
 - to use the new inference containers
 - Use SageMaker 2.x

* Clear outputs

Clear outputs in the notebook

* Fix typo

Fix typo in text box

* Undo change to iterations in old way

Undo change to iterations in old way

* Code Review feedback

Organize imports

Code Review feedback

* CR

Use new inference containers for both uncompiled and compiled flows.

* CR

Remove incorrect code comments

* Update versions of torch and torchvision

Co-authored-by: EC2 Default User <[email protected]>

* add template notebook (#1570)

* add template notebook

* resolve comments

* Bump tensorflow (#1574)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 1.13.1 to 1.15.4.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v1.13.1...v1.15.4)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* mxnet_mnist.ipynb fix (#1597)

* Update mxnet_mnist.ipynb

Set notebook to default to CPU training

* Update mxnet_mnist.ipynb

* updated birds dataset download source (#1593)

* fix pandas errors in notebooks (#1490)

* Refactor the Debugger detect_stalled_training_job_and_stop.ipynb notebook (#1592)

* publish BYOC with Debugger notebook

* some test change

* revert the kernel names in the metadata

* fix typos

* incorporate feedback

* incorporate comments

* pin to pysdk v1

* remove installation output logs

* refactor the stalled training job notebook

* remove unnecessary module imports / minor fix

* incorporate feedback

* minor fix

* fix typo

* minor fix

* fix unfinished sentence

* incorporate feedback

* minor fix

Co-authored-by: Miyoung Choi <[email protected]>

* Make RL training compatible with PyTorch (#1520)

* Make RLEstimator() PyTorch compatible & modify cartpole notebook

* set use_pytorch to False by default

* minor refactor; check in first unit test

* indent correction

* Verify sagemaker SDK version (#1606)

* updating mxnet_mnist notebook (#1588)

* updating mxnet_mnist notebook

* typo fix

* refactoring

* refactored mnist.py

* updated bucket paths in the notebook for better organization

* notebook updated to handle sdk upgrade

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>

* fixing Model Package ARNs and removing region specific dependency (#1611)

* fixing Model Package ARNs and removing region specific dependency

* Adding a disclaimer on reference notebooks

Co-authored-by: kwwaikar <[email protected]>

* Fix: add 'import tensorflow as tf' required by _save_tf_model (#1560)

Co-authored-by: Felipe Antunes <[email protected]>

* Update xgboost churn neo example for sagemaker v2 (#1591)

* Update xgboost churn neo example for sagemaker v2

* Remove use of latest version

* Add sagemaker installation command and remove duplicate import

* Use sagemaker pysdk v2

* Add setup and cleanup steps

* clear output

* Revert kernel metadata

Co-authored-by: Nikhil Kulkarni <[email protected]>

* Add integration tests using Papermill library for RL notebooks. List of notebooks covered in the tests: (#1580)

1. rl_cartpole_coach/rl_cartpole_coach_gymEnv.ipynb
2. rl_cartpole_ray/rl_cartpole_ray_gymEnv.ipynb

Co-authored-by: Akash Goel <[email protected]>

* Delete KernelExplainerWrapper and remove importing LogitLink and IdentityLink (#1603)

* update-neo-mxnet-notebooks (#1625)

* update-neo-mxnet-notebooks

* refactoring and typo fixes

* Add Ground Truth Streaming notebooks (#1617)

* Add Ground Truth Streaming notebooks

* Made below changes

* Replace .format with f-strings
* Added pip sagemaker isntall
* Download image from public url
* Minor comments

* Minor f-string updates to chained notebook

Co-authored-by: Gopalakrishna, Priyanka <[email protected]>

* Added downgrade to SDK 1.72 and edited the text. Verified notebook runs through with no errors. (#1633)

* Add SDK version rollback code. (#1634)

* Running tests in parallel for RL notebooks. (#1624)

Co-authored-by: Akash Goel <[email protected]>

* fix: resolve breaking changes of neo container, adding `softmax_label` to `compile_model` (#1635)

* Fixes #902 (#1632)

* fix probability out of bound

* fixed probability out of bound

* cleared the notebook output

* fix of probabilities out of bound

* adding an example for Linear Learner regression use case with abalone dataset and input csv format (#1622)

* infra: add PR buildspec (#1642)

* add notebook instance buildspec

* Update HPO_Analyze_TuningJob_Results.ipynb on where to retrieve a HP job (#1637)

* Update HPO_Analyze_TuningJob_Results.ipynb

Adding instructions on where to find the hyperparameter jobs needed as input.

* Update hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb

Co-authored-by: Aaron Markham <[email protected]>

* infra: update buildspec (#1649)

* update buildspec

* terminate early if no notebooks in PR

* reformat command

* move conditional to build phase as one command

* removing object2vec_multilabel_genre_classification.ipynb (#1648)

* adding preprocessing tabular data notebooks

* incorporating changes

* incorporating changes

* incorporating changes

* incorporating few changes

* minor fix to persist sagemaker version

* minor fix to persist sagemaker version

* removing notebook

Co-authored-by: Ajay Karpur <[email protected]>

* fix: move the Tensorflow import in coach_launcher.py inside the _save_tf_model fn (#1652)

Co-authored-by: Akash Goel <[email protected]>

* delete extra common folder inside rl_game_server_autopilot/sagemaker directory (#1653)

Co-authored-by: Akash Goel <[email protected]>

* Removed pip install, edited for clarity, tested on JupyterLab (#1660)

* doc: fix typos in PyTorch CIFAR-10 notebook (#1650)

* fix typos in PyTorch CIFAR-10 notebook

* deliberately raise error to test PR build

* Revert "deliberately raise error to test PR build"

This reverts commit 7c2bac3.

* Update mm byo (#1663)

* Added note that nb won't run in studio, add note about kernel and sdk version testing details

* changed kernel metadata back to conda_mxnet_p36

* Removed conda command to install s3fs. (#1659)

* change: updated for sagemaker python sdk 2.x (#1667)

* min_df was larger than max_df and outside of the acceptable range of 0.0-1.0 (#1601)

* min_df was larger than max_df and outside of the acceptable range of 0.0 to 1.0. This gave me an error but changing the min_df to 0.2 or 0.02 resolved the error. It is unclear if the author intended min_df to be 0.2 or 0.02.

* Update ntm_20newsgroups_topic_model.ipynb

remove output and changed min_df to a likely better default of 0.2

Co-authored-by: Aaron Markham <[email protected]>

* Neo pytorch inf1 notebook (#1583)

* Add Neo notebook for PT model on Inf1

* Change target to inf1

* resolve comments

* Add revert sm version

* Add multiple cores instruction and fix revert sagemaker version

* polish instructions

* one more polish

* make sm version at least 2.11.0

* change to upgrade only

* remove fixed pytorch version

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>

* Update generate_example_data.py (#1077)

Added code solution for Bug in the Multinomial 
lines:        theta = np.asarray(theta).astype('float64')
        theta = theta / np.sum(theta)
and lines:             topic_word_distribution = np.asarray(topic_word_distribution).astype('float64')
            topic_word_distribution = topic_word_distribution / np.sum(topic_word_distribution)

Co-authored-by: Aaron Markham <[email protected]>

* Fix boolean argument parsing (#1681)

* Fixed predictions showing as array of False instead of a single True or False value (#1679)

* Fixed predictions matched showing as array of False instead of showing whether prediction is correct (True or False).

* Fixed predictions matched showing as array of False

* Fixed predictions showing as array of False instead of a single True or False

* Dev branch (#1688)

* Adding new project gpt-2

* Reviewed. Reset Kernel.

* made fix to reflect region names in model_package_arns

* Minor notebook content rearrangement

* fixed region-specific arns

* Update README.md

Added description for new project 'creative-writing-using-gpt-2-text-generation' under 'using_model_packages'

* Update README.md

added description for new project 'creative-writing-using-gpt-2-text-generation' under 'aws_marketplace/using_model_packages'

Co-authored-by: Alex Ignatov <[email protected]>

* fix: use image_uris module for retrieval (#1698)

* added autogluon v0.0.14 support, changed the build method (#1640)

* added autogluon v0.0.14 support, changed the build method

* changed the bash execution

Co-authored-by: Eric Johnson <[email protected]>

* added data ingestion notebooks (#1602)

* added data ingestion notebooks

data ingestion notebooks v1

* Added image for Athena and Redshift notebook

Added images displayed in two data ingestion notebooks -- Athena and Redshift

* Text Data Pre-processing Notebook

New notebook added for text data pre-processing, feedback incorporated

* Include Data Aggregation to text data ingestion (S3)

include the text data aggregation content to the text data ingestion notebook

* Modified Data Ingestion Notebooks and Text preprocessing Notebooks

Modified all seven (7) data ingestion and text preprocessing notebooks to incorporate feedback

* Modified the image data ingestion notebook

Added some note to downloading COCO dataset from online resources

* updated all the links in the notebooks

links to notebooks are changed to relative links; links to videos are removed for now and can be added later. Citations to data sources and existing aws notebooks are added.

* modified some links that were not working

modified links that's not working (refer to another folder)

* Modified 012 for running error

Removed a typo in 012

* updated SageMaker SDK, clear output, added data downloading

added data downloading to the beginning of each notebook; update SageMaker SDK at the beginning of each notebook; output cleared.

* Modified packages used in notebooks

modified packages used in 011, 012, 02, 04 and text data pre-processing.

Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: Talia <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* * Add framework_version to SKLearn estimator (#1716)

Co-authored-by: Sean Morgan <[email protected]>

* Fix autopilot_customer_churn.ipynb notebook for Sagemaker V2 SDK (#1699)

* Fix notebook for Sagemaker V2 SDK

* revert account change

Co-authored-by: Michele Ricciardi <[email protected]>

* Notebook fixed and cleaned (#1726)

* Notebook fixed and cleaned

* Comment reformatted

* Fixed notebooks for errors due to syntax change and cleaned notebooks (#1723)

* Revert "Fixed notebooks for errors due to syntax change and cleaned notebooks (#1723)" (#1730)

This reverts commit e691349.

* Revert "Notebook fixed and cleaned (#1726)" (#1732)

This reverts commit b68acb4.

* Sample notebook fix 2 (#1675)

* Reducing the random hpo resource values 

We've specified the total number of training jobs to be only 20 and the maximum number of parallel jobs to be 2.

* Edited the text to be consistent with the new parameter values.

With the new parameter values, this notebook now runs without error.

* fixed typo

fixed a typo

* Updated Neo compilation notebook for GluonCV Yolo example (#1638)

* Updated Neo compilation notebook for GluonCV Yolo example

* Minor fixes to comments and logging

Co-authored-by: Eric Johnson <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* Fixed malformed TensorFlow estimator declaration. (#1628)

* Fixed malformed TensorFlow estimator declaration.

* Removed extraneous output.

Co-authored-by: Eric Johnson <[email protected]>

* logx=False plots data as User_Score is <=10 (#1265)

logx=True doesn't seem appropriate since User_Score is <=10 the plot shows nothing

Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* Update detect_stalled_training_job_and_stop.ipynb (#1735)

* Updated sagemaker attribute configurations for V2 SDK support (#1636)

Co-authored-by: Aaron Markham <[email protected]>

* Update Batch Transform - breast cancer prediction with high level SDK.ipynb (#1138)

Fix a small bug.
Before specifying content_type='text/csv' in sm_transformer.transform, I get error that "Loading libsvm data failed with Exception, please ensure data is in libsvm format: <class 'ValueError'>"

Co-authored-by: Aaron Markham <[email protected]>

* Edit xgboost_customer_churn_studio.ipynb (#1060)

Co-authored-by: Aaron Markham <[email protected]>

* added a feature selection notebook (#1664)

* added a feature selection notebook

* addressed comments and renamed files for CI

* used model.model_data to index last trained model in s3

* added pip sagemaker>=2.15.0

* add lineage example notebooks (#90)

* add example notebook skeleton for fairness and explainability (#91)

Co-authored-by: Xinyu Liu <[email protected]>

Co-authored-by: Bartek Pawlik <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Krishna Chaitanya Koppolu <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: IvyBazan <[email protected]>
Co-authored-by: chenonit <[email protected]>
Co-authored-by: Valentin Flunkert <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Anna Luo <[email protected]>
Co-authored-by: Pratyush Bagaria <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Kanchan Waikar <[email protected]>
Co-authored-by: kwwaikar <[email protected]>
Co-authored-by: Felipe Antunes <[email protected]>
Co-authored-by: Felipe Antunes <[email protected]>
Co-authored-by: Nikhil Kulkarni <[email protected]>
Co-authored-by: Nikhil Kulkarni <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Somnath Sarkar <[email protected]>
Co-authored-by: gopalakp <[email protected]>
Co-authored-by: Gopalakrishna, Priyanka <[email protected]>
Co-authored-by: Laren-AWS <[email protected]>
Co-authored-by: Chuyang <[email protected]>
Co-authored-by: Hongshan Li <[email protected]>
Co-authored-by: moagaber <[email protected]>
Co-authored-by: Roald Bradley Severtson <[email protected]>
Co-authored-by: Paul B <[email protected]>
Co-authored-by: Eric Slesar <[email protected]>
Co-authored-by: PaulC-AWS <[email protected]>
Co-authored-by: Corvus LEE <[email protected]>
Co-authored-by: aserfass <[email protected]>
Co-authored-by: minlu1021 <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: hbono2019 <[email protected]>
Co-authored-by: H. Furkan Bozkurt <[email protected]>
Co-authored-by: Eitan Sela <[email protected]>
Co-authored-by: awsmrud <[email protected]>
Co-authored-by: Alex Ignatov <[email protected]>
Co-authored-by: Eric Johnson <[email protected]>
Co-authored-by: Yohei Nakayama <[email protected]>
Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: Talia <[email protected]>
Co-authored-by: Sean Morgan <[email protected]>
Co-authored-by: Sean Morgan <[email protected]>
Co-authored-by: Michele Ricciardi <[email protected]>
Co-authored-by: Michele Ricciardi <[email protected]>
Co-authored-by: vivekmadan2 <[email protected]>
Co-authored-by: playphil <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: sdoyle <[email protected]>
Co-authored-by: fyang1234 <[email protected]>
Co-authored-by: annbech <[email protected]>
Co-authored-by: Xinyu <[email protected]>
Co-authored-by: Xinyu Liu <[email protected]>
ajaykarpur pushed a commit that referenced this pull request Dec 1, 2020
ajaykarpur pushed a commit that referenced this pull request Dec 1, 2020
ajaykarpur pushed a commit that referenced this pull request Dec 1, 2020
* Revert "add lineage example notebooks (#90)" (#94)

This reverts commit 1556e09.

* Revert "add example notebook skeleton for fairness and explainability (#91)" (#93)

This reverts commit d3a6c89.

* add lineage example notebook

Co-authored-by: David Nigenda <[email protected]>
ajaykarpur added a commit that referenced this pull request Dec 1, 2020
* GluonCV YoloV3 Darknet53 example training and inference with Neo (#1266)

* upgrade MNIST experiment notebook to SDK v2 (#1576)

* GluonCV YoloV3 Darknet53 example minor fixes (#1582)

* Code cell type corrected. Removed empty cell

* Unzip datasets if not available in the notebook's folder

* fix invalid json in MNIST notetook (#1594)

* Kkoppolu inference examples (#1587)

* Compilation examples changes for new inference containers

Update examples for PyTorch
 - to use the new inference containers
 - Use SageMaker 2.x

* Clear outputs

Clear outputs in the notebook

* Fix typo

Fix typo in text box

* Undo change to iterations in old way

Undo change to iterations in old way

* Code Review feedback

Organize imports

Code Review feedback

* CR

Use new inference containers for both uncompiled and compiled flows.

* CR

Remove incorrect code comments

* Update versions of torch and torchvision

Co-authored-by: EC2 Default User <[email protected]>

* add template notebook (#1570)

* add template notebook

* resolve comments

* Bump tensorflow (#1574)

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 1.13.1 to 1.15.4.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v1.13.1...v1.15.4)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* mxnet_mnist.ipynb fix (#1597)

* Update mxnet_mnist.ipynb

Set notebook to default to CPU training

* Update mxnet_mnist.ipynb

* updated birds dataset download source (#1593)

* fix pandas errors in notebooks (#1490)

* Refactor the Debugger detect_stalled_training_job_and_stop.ipynb notebook (#1592)

* publish BYOC with Debugger notebook

* some test change

* revert the kernel names in the metadata

* fix typos

* incorporate feedback

* incorporate comments

* pin to pysdk v1

* remove installation output logs

* refactor the stalled training job notebook

* remove unnecessary module imports / minor fix

* incorporate feedback

* minor fix

* fix typo

* minor fix

* fix unfinished sentence

* incorporate feedback

* minor fix

Co-authored-by: Miyoung Choi <[email protected]>

* Make RL training compatible with PyTorch (#1520)

* Make RLEstimator() PyTorch compatible & modify cartpole notebook

* set use_pytorch to False by default

* minor refactor; check in first unit test

* indent correction

* Verify sagemaker SDK version (#1606)

* updating mxnet_mnist notebook (#1588)

* updating mxnet_mnist notebook

* typo fix

* refactoring

* refactored mnist.py

* updated bucket paths in the notebook for better organization

* notebook updated to handle sdk upgrade

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>

* fixing Model Package ARNs and removing region specific dependency (#1611)

* fixing Model Package ARNs and removing region specific dependency

* Adding a disclaimer on reference notebooks

Co-authored-by: kwwaikar <[email protected]>

* Fix: add 'import tensorflow as tf' required by _save_tf_model (#1560)

Co-authored-by: Felipe Antunes <[email protected]>

* Update xgboost churn neo example for sagemaker v2 (#1591)

* Update xgboost churn neo example for sagemaker v2

* Remove use of latest version

* Add sagemaker installation command and remove duplicate import

* Use sagemaker pysdk v2

* Add setup and cleanup steps

* clear output

* Revert kernel metadata

Co-authored-by: Nikhil Kulkarni <[email protected]>

* Add integration tests using Papermill library for RL notebooks. List of notebooks covered in the tests: (#1580)

1. rl_cartpole_coach/rl_cartpole_coach_gymEnv.ipynb
2. rl_cartpole_ray/rl_cartpole_ray_gymEnv.ipynb

Co-authored-by: Akash Goel <[email protected]>

* Delete KernelExplainerWrapper and remove importing LogitLink and IdentityLink (#1603)

* update-neo-mxnet-notebooks (#1625)

* update-neo-mxnet-notebooks

* refactoring and typo fixes

* Add Ground Truth Streaming notebooks (#1617)

* Add Ground Truth Streaming notebooks

* Made below changes

* Replace .format with f-strings
* Added pip sagemaker isntall
* Download image from public url
* Minor comments

* Minor f-string updates to chained notebook

Co-authored-by: Gopalakrishna, Priyanka <[email protected]>

* Added downgrade to SDK 1.72 and edited the text. Verified notebook runs through with no errors. (#1633)

* Add SDK version rollback code. (#1634)

* Running tests in parallel for RL notebooks. (#1624)

Co-authored-by: Akash Goel <[email protected]>

* fix: resolve breaking changes of neo container, adding `softmax_label` to `compile_model` (#1635)

* Fixes #902 (#1632)

* fix probability out of bound

* fixed probability out of bound

* cleared the notebook output

* fix of probabilities out of bound

* adding an example for Linear Learner regression use case with abalone dataset and input csv format (#1622)

* infra: add PR buildspec (#1642)

* add notebook instance buildspec

* Update HPO_Analyze_TuningJob_Results.ipynb on where to retrieve a HP job (#1637)

* Update HPO_Analyze_TuningJob_Results.ipynb

Adding instructions on where to find the hyperparameter jobs needed as input.

* Update hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb

Co-authored-by: Aaron Markham <[email protected]>

* infra: update buildspec (#1649)

* update buildspec

* terminate early if no notebooks in PR

* reformat command

* move conditional to build phase as one command

* removing object2vec_multilabel_genre_classification.ipynb (#1648)

* adding preprocessing tabular data notebooks

* incorporating changes

* incorporating changes

* incorporating changes

* incorporating few changes

* minor fix to persist sagemaker version

* minor fix to persist sagemaker version

* removing notebook

Co-authored-by: Ajay Karpur <[email protected]>

* fix: move the Tensorflow import in coach_launcher.py inside the _save_tf_model fn (#1652)

Co-authored-by: Akash Goel <[email protected]>

* delete extra common folder inside rl_game_server_autopilot/sagemaker directory (#1653)

Co-authored-by: Akash Goel <[email protected]>

* Removed pip install, edited for clarity, tested on JupyterLab (#1660)

* doc: fix typos in PyTorch CIFAR-10 notebook (#1650)

* fix typos in PyTorch CIFAR-10 notebook

* deliberately raise error to test PR build

* Revert "deliberately raise error to test PR build"

This reverts commit 7c2bac3.

* Update mm byo (#1663)

* Added note that nb won't run in studio, add note about kernel and sdk version testing details

* changed kernel metadata back to conda_mxnet_p36

* Removed conda command to install s3fs. (#1659)

* change: updated for sagemaker python sdk 2.x (#1667)

* min_df was larger than max_df and outside of the acceptable range of 0.0-1.0 (#1601)

* min_df was larger than max_df and outside of the acceptable range of 0.0 to 1.0. This gave me an error but changing the min_df to 0.2 or 0.02 resolved the error. It is unclear if the author intended min_df to be 0.2 or 0.02.

* Update ntm_20newsgroups_topic_model.ipynb

remove output and changed min_df to a likely better default of 0.2

Co-authored-by: Aaron Markham <[email protected]>

* Neo pytorch inf1 notebook (#1583)

* Add Neo notebook for PT model on Inf1

* Change target to inf1

* resolve comments

* Add revert sm version

* Add multiple cores instruction and fix revert sagemaker version

* polish instructions

* one more polish

* make sm version at least 2.11.0

* change to upgrade only

* remove fixed pytorch version

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>

* Update generate_example_data.py (#1077)

Added code solution for Bug in the Multinomial 
lines:        theta = np.asarray(theta).astype('float64')
        theta = theta / np.sum(theta)
and lines:             topic_word_distribution = np.asarray(topic_word_distribution).astype('float64')
            topic_word_distribution = topic_word_distribution / np.sum(topic_word_distribution)

Co-authored-by: Aaron Markham <[email protected]>

* Fix boolean argument parsing (#1681)

* Fixed predictions showing as array of False instead of a single True or False value (#1679)

* Fixed predictions matched showing as array of False instead of showing whether prediction is correct (True or False).

* Fixed predictions matched showing as array of False

* Fixed predictions showing as array of False instead of a single True or False

* Dev branch (#1688)

* Adding new project gpt-2

* Reviewed. Reset Kernel.

* made fix to reflect region names in model_package_arns

* Minor notebook content rearrangement

* fixed region-specific arns

* Update README.md

Added description for new project 'creative-writing-using-gpt-2-text-generation' under 'using_model_packages'

* Update README.md

added description for new project 'creative-writing-using-gpt-2-text-generation' under 'aws_marketplace/using_model_packages'

Co-authored-by: Alex Ignatov <[email protected]>

* fix: use image_uris module for retrieval (#1698)

* added autogluon v0.0.14 support, changed the build method (#1640)

* added autogluon v0.0.14 support, changed the build method

* changed the bash execution

Co-authored-by: Eric Johnson <[email protected]>

* added data ingestion notebooks (#1602)

* added data ingestion notebooks

data ingestion notebooks v1

* Added image for Athena and Redshift notebook

Added images displayed in two data ingestion notebooks -- Athena and Redshift

* Text Data Pre-processing Notebook

New notebook added for text data pre-processing, feedback incorporated

* Include Data Aggregation to text data ingestion (S3)

include the text data aggregation content to the text data ingestion notebook

* Modified Data Ingestion Notebooks and Text preprocessing Notebooks

Modified all seven (7) data ingestion and text preprocessing notebooks to incorporate feedback

* Modified the image data ingestion notebook

Added some note to downloading COCO dataset from online resources

* updated all the links in the notebooks

links to notebooks are changed to relative links; links to videos are removed for now and can be added later. Citations to data sources and existing aws notebooks are added.

* modified some links that were not working

modified links that's not working (refer to another folder)

* Modified 012 for running error

Removed a typo in 012

* updated SageMaker SDK, clear output, added data downloading

added data downloading to the beginning of each notebook; update SageMaker SDK at the beginning of each notebook; output cleared.

* Modified packages used in notebooks

modified packages used in 011, 012, 02, 04 and text data pre-processing.

Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: Talia <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* * Add framework_version to SKLearn estimator (#1716)

Co-authored-by: Sean Morgan <[email protected]>

* Fix autopilot_customer_churn.ipynb notebook for Sagemaker V2 SDK (#1699)

* Fix notebook for Sagemaker V2 SDK

* revert account change

Co-authored-by: Michele Ricciardi <[email protected]>

* Notebook fixed and cleaned (#1726)

* Notebook fixed and cleaned

* Comment reformatted

* Fixed notebooks for errors due to syntax change and cleaned notebooks (#1723)

* Revert "Fixed notebooks for errors due to syntax change and cleaned notebooks (#1723)" (#1730)

This reverts commit e691349.

* Revert "Notebook fixed and cleaned (#1726)" (#1732)

This reverts commit b68acb4.

* Sample notebook fix 2 (#1675)

* Reducing the random hpo resource values 

We've specified the total number of training jobs to be only 20 and the maximum number of parallel jobs to be 2.

* Edited the text to be consistent with the new parameter values.

With the new parameter values, this notebook now runs without error.

* fixed typo

fixed a typo

* Updated Neo compilation notebook for GluonCV Yolo example (#1638)

* Updated Neo compilation notebook for GluonCV Yolo example

* Minor fixes to comments and logging

Co-authored-by: Eric Johnson <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* Fixed malformed TensorFlow estimator declaration. (#1628)

* Fixed malformed TensorFlow estimator declaration.

* Removed extraneous output.

Co-authored-by: Eric Johnson <[email protected]>

* logx=False plots data as User_Score is <=10 (#1265)

logx=True doesn't seem appropriate since User_Score is <=10 the plot shows nothing

Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: Ajay Karpur <[email protected]>

* Update detect_stalled_training_job_and_stop.ipynb (#1735)

* Updated sagemaker attribute configurations for V2 SDK support (#1636)

Co-authored-by: Aaron Markham <[email protected]>

* Update Batch Transform - breast cancer prediction with high level SDK.ipynb (#1138)

Fix a small bug.
Before specifying content_type='text/csv' in sm_transformer.transform, I get error that "Loading libsvm data failed with Exception, please ensure data is in libsvm format: <class 'ValueError'>"

Co-authored-by: Aaron Markham <[email protected]>

* Edit xgboost_customer_churn_studio.ipynb (#1060)

Co-authored-by: Aaron Markham <[email protected]>

* added a feature selection notebook (#1664)

* added a feature selection notebook

* addressed comments and renamed files for CI

* used model.model_data to index last trained model in s3

* added pip sagemaker>=2.15.0

* add lineage example notebooks (#90)

* add example notebook skeleton for fairness and explainability (#91)

Co-authored-by: Xinyu Liu <[email protected]>

Co-authored-by: Bartek Pawlik <[email protected]>
Co-authored-by: Dana Benson <[email protected]>
Co-authored-by: Krishna Chaitanya Koppolu <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: IvyBazan <[email protected]>
Co-authored-by: chenonit <[email protected]>
Co-authored-by: Valentin Flunkert <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Anna Luo <[email protected]>
Co-authored-by: Pratyush Bagaria <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Kanchan Waikar <[email protected]>
Co-authored-by: kwwaikar <[email protected]>
Co-authored-by: Felipe Antunes <[email protected]>
Co-authored-by: Felipe Antunes <[email protected]>
Co-authored-by: Nikhil Kulkarni <[email protected]>
Co-authored-by: Nikhil Kulkarni <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Somnath Sarkar <[email protected]>
Co-authored-by: gopalakp <[email protected]>
Co-authored-by: Gopalakrishna, Priyanka <[email protected]>
Co-authored-by: Laren-AWS <[email protected]>
Co-authored-by: Chuyang <[email protected]>
Co-authored-by: Hongshan Li <[email protected]>
Co-authored-by: moagaber <[email protected]>
Co-authored-by: Roald Bradley Severtson <[email protected]>
Co-authored-by: Paul B <[email protected]>
Co-authored-by: Eric Slesar <[email protected]>
Co-authored-by: PaulC-AWS <[email protected]>
Co-authored-by: Corvus LEE <[email protected]>
Co-authored-by: aserfass <[email protected]>
Co-authored-by: minlu1021 <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: hbono2019 <[email protected]>
Co-authored-by: H. Furkan Bozkurt <[email protected]>
Co-authored-by: Eitan Sela <[email protected]>
Co-authored-by: awsmrud <[email protected]>
Co-authored-by: Alex Ignatov <[email protected]>
Co-authored-by: Eric Johnson <[email protected]>
Co-authored-by: Yohei Nakayama <[email protected]>
Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: ZoeMa <[email protected]>
Co-authored-by: Talia <[email protected]>
Co-authored-by: Sean Morgan <[email protected]>
Co-authored-by: Sean Morgan <[email protected]>
Co-authored-by: Michele Ricciardi <[email protected]>
Co-authored-by: Michele Ricciardi <[email protected]>
Co-authored-by: vivekmadan2 <[email protected]>
Co-authored-by: playphil <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: sdoyle <[email protected]>
Co-authored-by: fyang1234 <[email protected]>
Co-authored-by: annbech <[email protected]>
Co-authored-by: Xinyu <[email protected]>
Co-authored-by: Xinyu Liu <[email protected]>
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
* Add cpu build scripts

uncomment script

* Add tf scripts testing

* Add gpu tests as well

* fix cpu build and us-west-2

* remove sagemaker container
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
* rename env variable
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
* rotation policy

* fix tests

* fix write event call

* add comments in code

* add a test through hook

* fix rotation

* some fixes

* delete file if empty

* enable multi-process test

* fix multi-process test

* add pt distrib test

* Revert "add pt distrib test"

This reverts commit a8fc661a02ba29e6fdc49019006b2dafc3cbd67d.

* enable write to s3

* address some review comments

* address some more review comments

* cleanup

* some fixes

* make timestamp mandatory
* filename timestamp matches 1st event

* more cleanup and fixes

* consolidate classes
* timestamp in UTC

* address review comments

* edit base_start_time

* remove delete if empty

* default queue size and flush secs

* Add timestamp test

* add abs and rel timestamp in record

* save default values to constants file

* Cached the names of parsed files to avoid parsing them everytime.

* address review comments

* lazy file creation
* drop events if file creation fails
* rename file to event end ts

* correct s3 bucket name

* test timestamp with file rotation

check if timestamp of all events in a file are lesser than timestamp in file name

* remove ref to s3

* remove changes to s3.py

* add checks for healthy writer

* test file open failure

* Cleanup hook

* Added the buffer for looking up trace file, removed the get_events_at_time function, updated the implementation of get_events to return the active events

* make timestamp mandatory everywhere

* fix mxnet test

* Corrected the multiplier for microseconds

* remove flush_secs

* Updating the tests directory with new file format.

* Simplify class structure

* save base_start_time in record

* Updated the test directories to the updated YYYYMMDDHR format

* init env variables once

* Renamed the function and added function comments

* address some review comments

* cleanup

* Fixed the trace file look for start and end time events

* Truncating the trace files and updating the test file.

* fix pt test

* fallback node ID

* Removed the functionality to cap the upper_bound_timestamp

* Optimize the refreshing the file list based on the last available timestamp in the datasource viz. local or S3

* Correctly named the file suffix. Truncated the horovod timeline file

* Added the functionality to download the S3 files in parallel

* Addressed the review comments

* address review comments

* Trace events writer - part 2 (#6)

* ensure there's a dir for the new file
* add .tmp
* handle the case when events are far apart
* fix a mistake in cur_hour
* updated last_file_close_time to now

Co-authored-by: Vikas-kum <[email protected]>

* Record step duration in keras hook (#8)

* add step duration to keras hook

Co-authored-by: Vikas-kum <[email protected]>

* test TF step time with timeline writer (#9)

* Read node ID from Resource config (#10)

* read host ID from resource config

* use timeline writer directly (#11)

* Added functionality to record node_id in the events (#7)

* Added functionality to record node_id in the events

* Added the test to verify node id from file

* Moved the functions to extract node id and timestamp to utils directory.

* Add profiler config parser (#12)

* Timeline file name timestamp in us (#15)

* file timestamp in us

* Add comprehensive tests for detailed profiler config (#18)

* adding comprehensive tests

* refactoring fixtures

* renaming vars

* remove imports

* remove extraneous fixture

* PR changes

* documenting test cases

* documenting test cases

* refactoring fixtures

* Supporting efficiently downloding s3 files for distributed training (#14)

* Supporting efficiently downloding s3 files for distributed training

* updated op_name and args when recording step duration (#17)

* fixes for right directory name(#20)

* Fix folder name (#21)

* fixes
* change all variables to microsecs

* Updating the files to fix the pre-commit failures (#23)

* Change invalid file path (#25)

* change invalid file path

* fix other precommit errors

* Add error handling for parsing profiler config (#27)

* Fixing the tests for CI (#28)

* Fixing the tests for CI

* fix out_dir bug

Co-authored-by: Neelesh Dodda <[email protected]>

* Default path for profiler has changed (#29)

* Update and correct some documentation (#30)

* Enabling TF profiler in smdebug (#5)

* Enabling TF profiler in smdebug
Co-authored-by: Neelesh Dodda <[email protected]>

* change variable name and folder path (#35)

* change variable name and folder path

* add tests to check rotation policy

* Add ProfilerSystemMetricFileParser and basic tests (#16)

* Add ProfilerSystemMetricFileParser and basic tests
* Refactor MetricsReaderBase class
* Fix timestamp to event files mapping for both MetricsReader and SystemMetricsReader
* rename MetricsReader to AlgorithmMetricsReader

* refactoring. Providing a way to avoid cache and hence going OOM (#38)

* refactoring. Providing a way to avoid cache and hence going OOM
* modifying test cases to have use_in_memory_cache param

* Time annotations in PyTorch hook (#13)

* modified pytorch hook to record time annotations
Co-authored-by: Vikas Kumar <[email protected]>

* Pulling in changes from smdebug repo to private (#39)

* latest commit from smdebug repo master is 
* Disable TB Testing  (aws#275) with commit id b8661de
Co-authored-by: Nihal Harish <[email protected]>
Co-authored-by: Vikas-kum <[email protected]>

* Reorganizing the profiler tests for PR CI build (#41)

* Organized the profiler tests.

* Updated the tests.sh for PR CI build

* Updated the tests.sh for PR CI build

* profiler dashboards (#4)

* add files for profiler dashboards
* updated dashboards to use timeline reader
* fixed bug 2,5,6,7,9,10 from bugbash
* fixed bug 1,3,4,8,16,17,19 from bugbash
* linked x-axis of timeline charts

* Creating a generic profiler dashboard & report (#42)

* Creating a generic profiler dashboard which can take a training job name and region
and execute the notebook.

* review comments

* Updated notebooks and added Pandas functionalities (aws#43) (aws#44)

* updated notebook and added Pandas functionalities
* minor fixes in profiler_generic_dashboard.ipynb

Co-authored-by: Nathalie Rauschmayr <[email protected]>

* Enable file rotation for Horovod trace file (#33)

* Hvd file reader and rotation of files

Co-authored-by: Anirudh <[email protected]>

* Pytorch profiler new (#40)

* adding profiling info to pytorch hook

* imore changes

* capturing forward and backward time from within pytorch hook
Note that hook provides backward end time, so backward start time
is approximated to end of last forward/backward or now
So, forward times and backward end times should be accurate while
backward start time is approximated.

* irmeoved print statements

* ran pre-commit and removed some log statements

* pre commit run

* Fixed the assert

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* Temporarily skipping the test on codebuild projects where pytorch is not installed.

* reverted the temporary changes

* Fixed the assert

* FIxing the CI test failure

* Fixed the code to include the last layer

* Updated the tests and refactored the TraceEvent class.

* Converted the rnn test to pytest variant

* Fixed the assert for passing CI

Co-authored-by: Vikas-Kum <[email protected]>
Co-authored-by: Vikas Kumar <[email protected]>

* Python profiler (#36)


Co-authored-by: Neelesh Dodda <[email protected]>

* Changes to horovod file parser (aws#46)

* TF2 profiler tests (aws#48)

* test detailed step/time based profiling

* Bug fixes for autograd profiler in Pytorch hook. (aws#50)

* fixed pytorch hook

* fixed merge conflict

* fixed bug in hook

* Adding action class (aws#285) (aws#54)

* Adding action class
Actions added: stop trianing job, email,  sms

Co-authored-by: Vikas-kum <[email protected]>

* Pull in changes from the sagemaker-debugger repository (aws#55)

* Pull in changes from the sagemaker-debugger repository

* Typecasting profiling parameters to int (aws#52)

* Refactor analysis utils (aws#57)

* Integration tests for profiler on sagemaker (#19)

scripts and infrastructure code

* Typecasting str profiling parameters to bool (aws#58)

* Typecasting str profiling parameters to bool

* Add pyinstrument for python profiling (aws#56)

* Make DetailedProfilingConfig a string in profiler config (aws#67)

* detailed profiling config now is string

* install tf_datasets (aws#66)

* Convert profiler data to pandas frame (aws#47)

* add class to convert profiler data to pandas frame

* fixed local reader

* add notebook for pandas queries

* added code to find workload balancing issues in multi GPU training

* Adding more checks to integration tests (aws#73)

* pytorch Added step event, mode and more details to detailed profiling (aws#78)

* Added step event, mode and more details to detailed profiling
* Changing op name string
* Making op_name equivalent to TF
* changing step num to mode_step
* Adding phase to autograd events

* Change timeline node_id for distributed workers (aws#80)

* change timeline node_id for distributed workers

* Add integration tests for detailed profiling and python profiling (aws#71)

* Fixing a bug where step num was not correctly used when enabling
detailed profiling
Dumping the torch autograd profiler every step. If there are multiple steps
then data builds up and can cause gpu memory build up.

* Feature to profile for different step phases
2.Capturing profiling step phases for pytorch
3.Fix bug with path string which was always having cprofile in path
even if pyinstrument profiler is used

* Fix pre-commit

* Fix call to stats_filename

* Fixing PythonStepStats

* auto commit

* ifix
x

* iFix

* fix

* pre commit fix

* fix bug

* removed code

* make profiling parameters case insensitive

* docstring for case insensitive config

* precommit

* push profiler images to alpha and get tag from environment variable

* push profiler images to alpha and get tag from environment variable

* Add height param to HeatMap

* specify registry ID as env variable, alpha by default

* Some cleanup, adding total time in cprofile

* Refactored metricsHistogram and stepHistogram and amde more modular

* separate usepyinstrument

* iFixes for metrics historgram

* Fixing StepHistogram

* removing pritn with logger

* refactoring

* changes in detailed profiling

* remove imports

* notebook fixes and histogram class fixes

* Adding wheel lfile

* running pre-commit

* fix tests

* Adding unique thread id , pid, for trace event parser
In every event added event_phase, node_id

* pre-commit

* fixing notebook and other changes

* fix check for event_Args None

* Changing ntoebook

* upload files to s3 during test

* minor fix

* create new s3 folder for stats

* fix syntax errors

* Some cleanup

* Fix int typecast for rotatemaxfilesizebytes (#19)

Co-authored-by: Vikas-kum <[email protected]>

* Pull in smdebug 145d43b (#38)

* Pull in latest smdebug (0.9.1) (upto commit 145d43b)
* Reverting the change to GET_OBJECTS_MULTIPROCESSING_THRESHOLD in #14.

* Adding metadata file for TF Profiler parser to include startitime (#4)

* TF profiler event parser
* fix can_start_prof bug
* populate start time
* handle tf trace json in reader
* separate file for metadata

* Reorder the writing of events so that events get correctly written according to their end timestamp. (#39)

Co-authored-by: Vikas-kum <[email protected]>

* Enable profiling between steps for tensorflow (#2)

* Dump HTML for each pyinstrument stats file (#16)

* output html in python profiler
* dump output html for pyinstrument

* Add higher level analysis functions for cProfile python profiling (#6)

* Updated preview notebooks  (#8)

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Adding analysis utils and updating the analysis notebook (#9)

* add pandas analysis utils
* update profiler analysis notebook (#32)
* Updated analysis utils (#34)
* add python profiling to notebook (untested)

Co-authored-by: NRauschmayr <[email protected]>
Co-authored-by: Neelesh Dodda <[email protected]>

* check record end time similar to c++ writer (aws#45)

* remove flakiness offset from sm tests (aws#43)

* Add example notebook fixes for python profiling (aws#46)

* Refactored profiler dashboards  (#42)

* refactored dashboards to plot new system metrics

* updated step timeline chart to plot train/eval/global step

* bugfixes for analysis notebook (aws#44)

* Bugfixes in analysis and notebooks (aws#49)

* Followup to the PR on analysis utils (aws#50)

* Prevent metrics reader from reading invalid files (aws#52)

* Modify horovod tests to generate check for horovod timeline (aws#51)

* Bugfixes  (aws#57)

* fix for dashboards

* Add timeline image for bottlenecks notebook (aws#59)

* Error handling for pyinstrument (aws#58)

* Enable/disable python profiling after forward pass of pytorch hook instead of backward pass (aws#56)

* Pytorch integration tests (#33)

* Enabling integration tests for pytorch

* Fixed the job index for codebuild project.

* Fixed the job index for codebuild project.

* Fixing the codebuild project to install smdebugger in docker

* Fixing codebuild project

* Adding cpu jobs

* Adjusted the parameters for cpu jobs

* PyTorch detailed profiler traces are not present in detailed_profiling directory.

* Fixing the test yml file.

* Fixing the test yml file.

* Removed commented code.

* Added test configuration for absent profiler.

* Preloading the cifar10 dataset into source directory.

* ENabled the assert for checking the timestamp

* adjusted the tracefile counts

* Fixed the job names, added tests for cprofile

* Updated the job configs

* Adjusted the expected trace file count.

* Changed the order in which the trace events are written

* Reduced the batch size for cpu tests.

* Reduced the batch size for cpu tests.

* Fixed the imports

* Added capability to handle html file.

* Adding horovod tests for integration

* Adding horovod tests for integration

* Fixed the assert for horovod trace file count

* Valid trace file check (#41)

* fix valid trace file check
* change log level

* Fixed the expected count of stats and trace files.

* Fixed the profiler config name UsePyinstrument

* Preloading mnist dataset to avoid downloading it from internet during training.

* Bugfixes in analysis and notebooks (aws#49)

* Added test scenario to test the file rotations.

* Adding more test scenarios

* Adding integration test for distributed training using distributed api

* Adding horovod training with resnet50 and cifar10

* FIxing tehe launcher script for resnet50 with horovod.

* Increased the batch size

* Supporting res50 and cifar with horovod.

* Fixed the validation for horovod tracefiles.

* Update tests/sagemaker/test_profiler_pytorch.py

Co-authored-by: Anirudh <[email protected]>

* Scheduling sagemaker jobs in parallel.

* Fixed the config file path.

Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Nathalie Rauschmayr <[email protected]>
Co-authored-by: Anirudh <[email protected]>

* Fix buildspec yaml file for TF integration tests (aws#66)

* Merge latest changes from smdebug to smprofiler (aws#68)

* Updating analysis utils (aws#63)

* Modify step stats util to compute stats for multiproc data
* Modify utils to handle multi-node data
* Modify notebook utils to handle multi-node data

Co-authored-by: Neelesh Dodda <[email protected]>

* Merge timeline for framework events (#5)

* Fixing the CI failure caused by awscli (aws#72)

* Add metrics config (aws#67)

* Add API functions to python profiling analysis for correlation with framework metrics (aws#53)

* Dataloader analysis for PyTorch (aws#64)

* Adding the functions to get the dataloader events for pytorch

* Adding the training script and notebook for dataloader analysis

* Fixed the timeconversion from timestamp to UTC and fixed the local reader for system tracefiles.

* Updating the dataloader analysis notebook

* Updated the notebook with analysis for batch processing.

* Updated notebook to display python profiler stats.

* Updated the notebook with documenation and layout

* Updated the notebook to have static contents

* Updating the notebook to handle absence of traceevents

* FIxed the tracevents as per the current format and added notebook for triggering the pytorch training jobs

* Moved the analysis functions from notebook to a class

* Updated the utility functions to retrieve the dataloader events

* Added the test scripts for horovod and distributed training

* Adding a script that uses dummy custom dataloader

* Addressed the review comments

* Updated the utility code and added a training script that uses custom datasets

* Added hyper parameteres for custom dataset training.

* Fix TF event file decompression issue (aws#73)

* Fix bugs in keras hook (aws#75)

* Reorder events in pytorch hook (aws#60)

* Refactor metrics config (aws#76)

* Perf benchmark (#31)

* Fix for hvd reader issue and one more change (aws#74)

* Fixing the batch time analysis in interactive notebook to not generate incorrect plot (aws#81)

* Fixing the compuation of batchtime

* Fixing the compuation of batchtime

* retrigger CI

* Attempting to fix PR CI

* Attempting to fix PR CI for PyTorch

* Attempting to fix PR CI for PyTorch

* Merge timeline fixes (aws#82)

* Merge timeline fixes
1) putting the node_ids as threads.
2) Providing right sort order for processes and threads
3) Fixing bugs

* add check if gpu is available (aws#62)

Co-authored-by: Vikas-kum <[email protected]>

* Performance benchmarking for PyTorch (aws#78)

* Pytorch performance tests

* Fixed the estimator

* Fixed the training script for correct metrics generation

* Added train duration metrics in the training script

* Adjusted the alarm values

* Adjusted the alarm values

* Fixed the job name for no smdebug and no profiler

* Optimized the training script and added comments in the driver script.

* Updated the scripts for framework only training job

* Removed the unenecessary code.

* Updating the instance types.

* Notebook for interactive analysis (aws#69)

* Notebook for interactive analysis

* add python profiling to interactive analysis notebook

* Updated the interactive notebook with dataloader analysis for pytorch

* updated the utility functions to retrieve the dataloader events

* some changes to the nb

* some fixes to the nb

* fixes

* reset index

* editing nb content

* fixes

* nit fix

* fixes after metricsconfig

* update notebooks

* add updated job notebooks

* updated notebooks for bug bash

* update TF notebook

* rename notebooks

* rename notebooks

* updating notebooks with feedback

* Renamed Profiler to EagleEye

* minor edits

* scripts

* fix

* Updated the interactive anlaysis notebook with minor fix.

* Updated the instance type for rules to ml.m5.8xlarge'

* Updated the rules instances to ml.r5.4xlarge'

* miyoung's changes

Co-authored-by: Neelesh Dodda <[email protected]>
Co-authored-by: Amol Lele <[email protected]>
Co-authored-by: Anirudh <[email protected]>

* Fixed the metrics names to have correct instance names. (aws#88)

* Added empty name in an event during merge_timeline if it is missing (aws#87)

* Add an empty name only for Horovod and Herring events if name is missing for E events.

* Add ProfilerTrial class and profiler builtin rules  (aws#54)

* add files for gpu usage rule
* adding rule to detect cpu bottlenecks
* add rule to detect outliers in step duration
* added node id to rule analysis
* add rule for checking gpu memory increase
* added rules for batch size and max initialization time
* add rule to detect load balancing issues in multi GPU training
* add dockerfiles to build rule container
* applying changes from https://github.com/awslabs/sagemaker-profiler/commit/57dfe2bd960ae798610b6ff52f661a4f5475eded fixed output directory and label legends

Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Vikas Kumar <[email protected]>

* Fixing the writing of first event in the tracefile that stores the start time from epoch (aws#85)

* Fixing the writing of first event in the tracefile.

* Added the master table to ensure that we always write the metaevent in the new traceevent file.

* Fixing bugs in KerasHook and profiler utils (aws#89)

* Change smdebug version in notebooks (aws#90)

* change smdebug version
* rename tf_python_stats_dir to python_stats_dir

Co-authored-by: Neelesh Dodda <[email protected]>

* Dynamic ON/OFF Herring timeline for PyTorch framework (aws#80)

* Fix pytest version (aws#91)

* support mixed precision training (aws#96)

* merging sys metrics and bottlenecks in the timeline (aws#93)

* merging sys metrics and bottlenecks in the timeline

* Fix hvd failures and add native TF training in TF integration tests (aws#97)

* Reading rule stop signal file and stopping the rule if gracetime has … (aws#98)

* Reading rule stop signal file and stopping the rule if gracetime(60s) has passed

* [Sync] Sync smdebug with sagemaker-debugger master branch (aws#95)

Co-authored-by: Vikas-kum <[email protected]>
Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Anirudh <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Rahul Huilgol <[email protected]>
Co-authored-by: Amol Lele <[email protected]>

* add rule for framework metrics  (aws#100)

* add rule for framework metrics overview

* update report

* replaced matplolib figures with bokeh charts

* fix pre-commit error

* minor fixes in report notebook

Co-authored-by: Connor Goggins <[email protected]>

* Update Profiler Trial and Rules to Generate Report on Every Invoke (aws#102)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule (aws#105)

* [TRSL-1037] Emit RuleEvaluationConditionMet from ProfilerReport Rule

Update ProfilerReport rule to emit RuleEvaluationConditionMet if any subrule
having rule evaluation confition met.

* Update to emit RuleEvaluationConditionMet at the end of job

* Fix comment

* add unit test for ProfilerReport

* remove scanel_interval passed in

* Update unit tests

* Fix incorrect comment on last step.

* Update log message.

* Sync with sagemaker-debugger master branch and fix issue with tensorflow_datasets version (aws#114)

* Update sagemaker.md (aws#250)

* Bumping version to 0.9.0 (aws#251)

* Skip using standalone keras Py3.7+ (aws#253)

* Gradtape zcc (aws#252)

* Fix Incorrect Log Statement (aws#256)

* Incorrect number of tensors saved with MirroredStrategy (aws#257)

* Change Version to 0.8.1 (aws#258)

* Save Scalars With Mirrored Strategy (aws#259)

* skip flaky test (aws#262)

* Don't export to collections for all workers with unsupported distrib training (aws#263)

* version bump (aws#265)

* Avoiding Basehook object pickling (aws#266)

* handle eager tensors (aws#271)

* TF 2.x: Support for keras to estimator (aws#268)

* Revert "TF 2.x: Support for keras to estimator (aws#268)" (aws#273)

This reverts commit 749bded.

* Disable TB Testing  (aws#275)

* Support for TF 2 estimator (aws#274)

* Adding a TF2 Hvd example and test (aws#279)

* Moved end of training log from info to debug (aws#281)

awslabs/sagemaker-debugger#280

* Adding action class (aws#285)

* Adding action class
Actions added: stop trianing job, email,  sms

* Fix buildspec used for PR CI (aws#287)

* Adding a test to check that PT model is saved without issues (aws#283)

* test that model can be pickled without issues

* Save Model Inputs, Model Outputs, Gradients, Custom Tensors, Layer Inputs, Layer Outputs (aws#282)

* Pin pytest version (aws#293)

* Load IRIS Dataset from S3 (aws#298)

* Load dataset from s3 (aws#299)

* remove problematic log (aws#300)

* Change Enum (aws#301)

* Doc update (aws#292)

* rename enum (aws#305)

* version bump to 0.9.1 (aws#304)

* modify asserts (aws#307)

* version compare (aws#306)

* Support TF 2.3 Tests (aws#312)

* Disable TB in ZCC for AWS TF 2.3.0 (aws#316)

* Update Assert Statements For New TF 2.2.0 DLC (aws#320)

* Version Bump (aws#319)

* add a note for TF 2.2 limited support (aws#303)


Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Nihal Harish <[email protected]>

* TF 2.2 documentation update  (aws#322)

* update TF 2.2 smdebug features
* Update code samples/notes for new pySDK and smdebug/add and fix links
* add 'New features' note
Co-authored-by: Miyoung Choi <[email protected]>

* Adding pagination in list_training_jobs (aws#323)

* Adding pagination in list_Training_jobs

* Test Custom Step Usecase (aws#331)

* save tf2 model (aws#333)

* Add ability to only save shapes of tensors (aws#328)

* Revert "Add ability to only save shapes of tensors (aws#328)" (aws#337)

This reverts commit c9eb769.

* Function to Test If the hook has been configured with the Default hook config (aws#332)

* Default hook config (aws#338)

* version bump (aws#339)

* TF ZCC limitation footnote (aws#342)

* Ability to save shapes (aws#341)

* WIP saveshape

* Add shape writer

* Add pytorch test

* Add untested keras test

* fix syntax

* fix syntax

* Import

* Import

* Add tests for TF

* Simplify read code

* Add read API and tests

* Add mxnet test

* Add s3 and json tests

* lint

* Fix payload

* fix import

* Handle different num tensors for losses

* Fix exact equal condition

* Fix mode bug

* trigger CI

* Add support for distributed training with writer map

* Check that value throws exception

* Fix tests to make them more resilient

* Fix mxnet and pytorch tests

* Remove tensor names

* pre-commmit

* Fix get_mode

* Fix bug with old index files

* Fix keras test with names of tensors

* Set original name to None if tf_obj is None

* Fix mirrored test for cpu

* Add docs

* trigger CI

* Fix shape writer get

* Simplify by removing shape writer

* Cleanup

* Fix name of writer

* Addressed review comments

* trigger ci

* retrigger CI

Co-authored-by: NihalHarish <[email protected]>

* Support Inputs and Labels in the dict format (aws#345)

* 0.9.4 (aws#347)

* Refactor Make Numpy Array (aws#329)

* warn gradtape users  about tf.function support (aws#348)

* Support all tf types (aws#346)

* Model Subclassing Test (aws#351)

* Modify Should Save Tensor Test To Work on Any Version of TF (aws#352)

* framework version updates (aws#360)

* list training jobs improvements (aws#349)

* Earlier list training job would make 50 attempts irrespective. This may be bad because of unnecessary traffic.
* if there are training jobs found with prefix, we break
 * if there are exceptions caught more than 5 times we break.

* Handle Deprecation Of experimental_ref api (aws#356)

* check file exist before moving (aws#364)

* check file exist before moving when closing the file.

* Support Saving Tensors in Graph Mode with add_for_mode (aws#353)

* Change layer name logic (aws#357)

* Pass Variable Length Argument To Old Function Call (aws#366)

* test concat layers (aws#367)

* Update README.md (aws#371)

* Pinning the version of tensorflow_datasets package so that it does not require updating TF (aws#373)

Co-authored-by: NihalHarish <[email protected]>

* Bugfix: Debugger breaks if should_save_tensor is called before collections are prepared (aws#372)

* Fixing the nightly build pipelines. Avoid force reinstall of rules package when not necessary (aws#374)

* returning list instead of dict keys (aws#376)

fix in reuturn of _get_sm_tj_jobs_with_prefix . This function should return list always.

* Add support for mixed precision training (aws#378)

* Modify Asserts to Work with TF 2.1.0 and TF 2.0.0 (aws#380)

* pytorch tmp (aws#382)

* extend zcc to 2.1.2 (aws#384)

* disable pytorch (aws#386)

* Removed the redundant installation of smdebug and smdebug-rules (aws#391)

* Incrementing the version to 0.9.5 (aws#396)

* pin tensorflow dataset in test config (aws#399)

* add back test

* revert some changes

* unpin pytest version

Co-authored-by: Nihal Harish <[email protected]>
Co-authored-by: Vikas-kum <[email protected]>
Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Anirudh <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Rahul Huilgol <[email protected]>
Co-authored-by: Amol Lele <[email protected]>

* Changing the Herring user-facing API (aws#110)

* [TRSL-998] Update Rule Test with Result Checking (aws#106)

* [TRSL-998] Update Rule Test with Result Checking

Update existing rule testing to assert against rule output. This will ensure
rule are tested with its report result which should be deterministic thru CI.

* Generate HTML Report at every ProfilerReport invoke (aws#112)

This change adds HTML report generation at the end of every invoke of ProfilerReport rule.

* Update RuleEvaluationConditionMet to indicate end of the rule (aws#115)

* fix: Remove the hard code notebook file path (aws#117)

* Run rules tests in CI (aws#116)

* Log fix memory issue fix (aws#113)

* Changed the Herring API and variable names (aws#118)

* Removing the functionality to attach the backward hook to the module (aws#125)

* Removing the functionality to attach the backward hook to the module

* Updated the number of traceevents as the backward hook is no longer registered.

* Herring TF2 Native Graident Tape SMDebugger support (aws#122)

* Fix bug in base hook (aws#127)

* Minor bugfixes/changes in rules (aws#126)

* minor bugfixes for rules

* Updating batch size rule (aws#123)

* fix for batch size rule

* Dataloader rule (aws#108)

* added dataloader rule and updated profiler report

* Redesign TF dataloader metrics collection (aws#92)

* Update profiler config parser to match latest SDK changes (aws#120)

* Replaced herringsinglenode command with smddpsinglenode (aws#129)

* Updating the version for profiler GA release (aws#124)

* Updating the version for profiler GA release

* Trigger Build

* Trigger Build

* Trigger Build

* Fix paths in profiler report (aws#131)

* changed path in profiler report

* fixed env variable (aws#132)

* making info log to debug from trace event parser as it is very verbose (aws#134)

* Only do detailed profiling for supported TF versions. (aws#135)

* Update PT tests (aws#136)

* Fix bug in parser (aws#137)

* smdistributed.dataparallel should be invoked from mpi command (aws#138)

* smdistributed.dataparallel should be invoked from mpi command

* Added comments

* Bugfix: Invalid Worker (aws#139)

* smdistributed.dataparallel environment check (aws#140)

* smdistributed.dataparallel environment check

* addressed comments

* Modified check_smdataparallel_env logic

* Install rules packages in PR CI (aws#143)

* Removed the files and folders that are not required in the public repository

* Removed the integration tests.

* FIxed the pre-commit checks

Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Vikas-kum <[email protected]>
Co-authored-by: Vandana Kannan <[email protected]>
Co-authored-by: Nathalie Rauschmayr <[email protected]>
Co-authored-by: Neelesh Dodda <[email protected]>
Co-authored-by: Rajan Singh <[email protected]>
Co-authored-by: sife <[email protected]>
Co-authored-by: Anirudh <[email protected]>
Co-authored-by: Vikas Kumar <[email protected]>
Co-authored-by: Anirudh <[email protected]>
Co-authored-by: Karan Jariwala <[email protected]>
Co-authored-by: Nihal Harish <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung Choi <[email protected]>
Co-authored-by: Rahul Huilgol <[email protected]>
Co-authored-by: Connor Goggins <[email protected]>
Co-authored-by: JC-Gu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant