forked from aws/amazon-sagemaker-examples
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* WIP, added tf and core * WIP * add code from all repos, and fix imports * fix more imports, add tests * add docs, examples * fix imports in examples * fix setup.py and CI * fix test invoker * Reload a step directory when it was last seen as empty (aws#117) * fix imports * fix new imports * unskip test * Add setup.py * undo end of training merge * remove import * Add training end code * add frameworks * fix function used * update setup to use append * fixing small errors (aws#74) * testing * testing * testing * testing * testing * testing * testing * trigger ci * trigger ci * trigger ci * trigger ci * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * uploading test reports to s3 * uploading test reports to s3 * uploading test reports to s3 * uploading test reports to s3 * changes * changes * docs * Add subpackages in core * docs and examples * provides trials and rules as part of main namescope * move rules and trials outside * fix training end tests, and update setup.py * new readme for whole repo * fix setup.py * update packages * make the mxnet tests faster * reduce lenght of integration tests * add script to build binaries * update argument * change num steps and frequency * delete path * add boto3 * fix training end tests * changes * move exceptions to its own module * fix links * update version string in setup.py * uncommented test * making the pytorch stuff up to date (aws#79) * making the pytorch stuff up to date * reverting util.py * fixing the hook imports * fixing test imports * fix increment of step * training_has_ended fix for pytorch (aws#80) * making the pytorch stuff up to date * Revert "making the pytorch stuff up to date" This reverts commit f87f9560b5351f135553072c495f2123964b9f3c. * changing to training_has_ended
- Loading branch information
Showing
173 changed files
with
15,536 additions
and
267 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,31 @@ | ||
## Tornasole_core | ||
## Tornasole | ||
|
||
This repo is a part of code base of Tornasole: debugger and profile for Deep Learning training jobs and online/offline inference workloads. | ||
Tornasole is an upcoming AWS service designed to be a debugger | ||
for machine learning models. It lets you go beyond just looking | ||
at scalars like losses and accuracies during training and | ||
gives you full visibility into all tensors 'flowing through the graph' | ||
during training or inference. | ||
|
||
## License | ||
Using Tornasole is a two step process: | ||
|
||
### Saving tensors | ||
|
||
This needs the `tornasole` package built for the appropriate framework. | ||
It allows you to collect the tensors you want at the frequency | ||
that you want, and save them for analysis. | ||
Please follow the appropriate Readme page to install the correct version. | ||
|
||
|
||
#### [Tornasole TensorFlow](docs/tensorflow/README.md) | ||
#### [Tornasole MXNet](docs/mxnet/README.md) | ||
#### [Tornasole PyTorch](docs/pytorch/README.md) | ||
|
||
### Analysis | ||
Please refer **[this page](docs/analysis/README.md)** for more details about how to analyze. | ||
The analysis of these tensors can be done on a separate machine in parallel with the training job. | ||
|
||
## ContactUs | ||
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected] | ||
|
||
## License | ||
This library is licensed under the Apache 2.0 License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#!/usr/bin/env bash | ||
|
||
VERSION = '0.2.1' | ||
|
||
for FRAMEWORK in tensorflow mxnet pytorch | ||
do | ||
CAPITALIZED_FRAMEWORK=`echo "$FRAMEWORK" | tr '[a-z]' '[A-Z]'` | ||
TORNASOLE_WITH_$CAPITALIZED_FRAMEWORK=1 python setup.py bdist_wheel --universal | ||
# aws s3 cp dist/tornasole-$VERSION-py2.py3-none-any.whl s3://tornasole-binaries-use1/tornasole_$FRAMEWORK/py3/ | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,8 @@ | ||
if [ -z "$framework" ] | ||
then | ||
echo "framework is not mentioned" | ||
exit 1 | ||
fi | ||
#!/usr/bin/env bash | ||
|
||
if [ "$framework" = "tensorflow" ] | ||
then | ||
echo "Launching testing job using $framework framework" | ||
#export TORNASOLE_LOG_LEVEL=debug | ||
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/report.html --self-contained-html tests/ | ||
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_tensorflow.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode tensorflow --path_to_config ./tests/analysis/config.yaml | ||
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_mxnet.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode mxnet --path_to_config ./tests/analysis/config.yaml | ||
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_pytorch.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode pytorch --path_to_config ./tests/analysis/config.yaml | ||
|
||
|
||
elif [ "$framework" = "mxnet" ] | ||
then | ||
echo "Launching testing job using $framework framework" | ||
|
||
elif [ "$framework" = "pytorch" ] | ||
then | ||
echo "Launching testing job using $framework framework" | ||
|
||
|
||
else | ||
echo "$framework framework not supported!!!" | ||
exit 1 | ||
|
||
fi | ||
|
||
export TORNASOLE_LOG_LEVEL=debug | ||
python -m pytest tests/ |
Oops, something went wrong.