Skip to content

Commit

Permalink
One Repo to rule them all (aws#72)
Browse files Browse the repository at this point in the history
* WIP, added tf and core

* WIP

* add code from all repos, and fix imports

* fix more imports, add tests

* add docs, examples

* fix imports in examples

* fix setup.py and CI

* fix test invoker

* Reload a step directory when it was last seen as empty (aws#117)

* fix imports

* fix new imports

* unskip test

* Add setup.py

* undo end of training merge

* remove import

* Add training end code

* add frameworks

* fix function used

* update setup to use append

* fixing small errors (aws#74)

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* trigger ci

* trigger ci

* trigger ci

* trigger ci

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* uploading test reports to s3

* uploading test reports to s3

* uploading test reports to s3

* uploading test reports to s3

* changes

* changes

* docs

* Add subpackages in core

* docs and examples

* provides trials and rules as part of main namescope

* move rules and trials outside

* fix training end tests, and update setup.py

* new readme for whole repo

* fix setup.py

* update packages

* make the mxnet tests faster

* reduce lenght of integration tests

* add script to build binaries

* update argument

* change num steps and frequency

* delete path

* add boto3

* fix training end tests

* changes

* move exceptions to its own module

* fix links

* update version string in setup.py

* uncommented test

* making the pytorch stuff up to date (aws#79)

* making the pytorch stuff up to date

* reverting util.py

* fixing the hook imports

* fixing test imports

* fix increment of step

* training_has_ended fix for pytorch (aws#80)

* making the pytorch stuff up to date

* Revert "making the pytorch stuff up to date"

This reverts commit f87f9560b5351f135553072c495f2123964b9f3c.

* changing to training_has_ended
  • Loading branch information
rahul003 authored Aug 14, 2019
1 parent c2112aa commit 494498e
Show file tree
Hide file tree
Showing 173 changed files with 15,536 additions and 267 deletions.
30 changes: 27 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,31 @@
## Tornasole_core
## Tornasole

This repo is a part of code base of Tornasole: debugger and profile for Deep Learning training jobs and online/offline inference workloads.
Tornasole is an upcoming AWS service designed to be a debugger
for machine learning models. It lets you go beyond just looking
at scalars like losses and accuracies during training and
gives you full visibility into all tensors 'flowing through the graph'
during training or inference.

## License
Using Tornasole is a two step process:

### Saving tensors

This needs the `tornasole` package built for the appropriate framework.
It allows you to collect the tensors you want at the frequency
that you want, and save them for analysis.
Please follow the appropriate Readme page to install the correct version.


#### [Tornasole TensorFlow](docs/tensorflow/README.md)
#### [Tornasole MXNet](docs/mxnet/README.md)
#### [Tornasole PyTorch](docs/pytorch/README.md)

### Analysis
Please refer **[this page](docs/analysis/README.md)** for more details about how to analyze.
The analysis of these tensors can be done on a separate machine in parallel with the training job.

## ContactUs
We would like to hear from you. If you have any question or feedback, please reach out to us [email protected]

## License
This library is licensed under the Apache 2.0 License.
10 changes: 10 additions & 0 deletions bin/build_binaries.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/usr/bin/env bash

VERSION = '0.2.1'

for FRAMEWORK in tensorflow mxnet pytorch
do
CAPITALIZED_FRAMEWORK=`echo "$FRAMEWORK" | tr '[a-z]' '[A-Z]'`
TORNASOLE_WITH_$CAPITALIZED_FRAMEWORK=1 python setup.py bdist_wheel --universal
# aws s3 cp dist/tornasole-$VERSION-py2.py3-none-any.whl s3://tornasole-binaries-use1/tornasole_$FRAMEWORK/py3/
done
20 changes: 6 additions & 14 deletions config/buildspec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,13 @@ version: 0.2
phases:
install:
commands:
- . config/get-branch.sh #EXPORTS BRANCHES FOR OTHER REPOS AND CURRENT REPO.
- . config/get-branch.sh #don't need this
- su && apt-get update
- apt-get install sudo
- sudo apt-get update
- sudo apt-get install unzip
- cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh
- pip install pytest
- pip install wheel
- pip install pytest wheel pyYaml pytest-html tensorflow mxnet torch
- pip uninstall -y boto3 && pip uninstall -y aiobotocore && pip uninstall -y botocore

pre_build:
Expand All @@ -20,18 +19,11 @@ phases:
build:
commands:
- cd $CODEBUILD_SRC_DIR && python setup.py bdist_wheel --universal && pip install dist/*.whl && cd ..
- if [ $framework = "tensorflow" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_tf && git checkout $TF_BRANCH && python setup.py bdist_wheel --universal && pip install dist/*.whl && cd .. ; fi
- if [ $framework = "mxnet" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_mxnet && git checkout $MXNET_BRANCH && python setup.py bdist_wheel --universal && pip install dist/*.whl && cd .. ; fi
- if [ $framework = "pytorch" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_tf && git checkout $TF_BRANCH && cd tornasole_pytorch && python setup.py bdist_wheel --universal && pip install dist/*.whl && cd .. ; fi
- cd $CODEBUILD_SRC_DIR_tornasole_rules && git checkout $RULES_BRANCH && python setup.py bdist_wheel --universal && pip install dist/*.whl && cd ..
- cd $CODEBUILD_SRC_DIR && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p $CURRENT_COMMIT_PATH && cp ./dist/*.whl $CURRENT_COMMIT_PATH && cd ..
- cd $CODEBUILD_SRC_DIR_tornasole_rules && git checkout $RULES_BRANCH && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p $RULES_PATH && cp ./dist/*.whl $RULES_PATH && cd ..
- if [ $framework = "tensorflow" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_tf && git checkout $TF_BRANCH && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p $TF_PATH && cp ./dist/*.whl $TF_PATH && cd .. ; fi
- if [ $framework = "mxnet" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_mxnet && git checkout $MXNET_BRANCH && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p $MXNET_PATH && cp ./dist/*.whl $MXNET_PATH && cd .. ; fi
- if [ $framework = "pytorch" ] ; then cd $CODEBUILD_SRC_DIR_tornasole_tf && git checkout $TF_BRANCH && cd tornasole_pytorch && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p $TF_PATH && cp ./dist/*.whl $TF_PATH && cd .. ; fi
- if [ "$CODEBUILD_GIT_BRANCH" = "master" ] && [ "$CODEBUILD_WEBHOOK_EVENT" = "PUSH" ] ; then aws s3 cp $CODEBUILD_SRC_DIR/wheels s3://tornasolecodebuildtest/ --recursive ; fi

- cd $CODEBUILD_SRC_DIR && chmod +x config/tests.sh && ./config/tests.sh && mkdir -p upload/$CURRENT_COMMIT_PATH/wheels && cp ./dist/*.whl upload/$CURRENT_COMMIT_PATH/wheels && cd ..
- aws s3 cp $CODEBUILD_SRC_DIR/upload s3://tornasolecodebuildtest/ --recursive
#if [ "$CODEBUILD_GIT_BRANCH" = "master" ] && [ "$CODEBUILD_WEBHOOK_EVENT" = "PUSH" ] ; then
post_build:
commands:
- if [ "$CODEBUILD_BUILD_SUCCEEDING" -eq 0 ]; then echo "ERROR BUILD FAILED , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK:$CODEBUILD_BUILD_URL" ; fi
- if [ "$CODEBUILD_BUILD_SUCCEEDING" -eq 1 ]; then echo "INFO BUILD SUCCEEDED !!! , ACCESS BUILD LOGS THROUGH GITHUB OR TROUGH THE LINK:$CODEBUILD_BUILD_URL" ; fi

5 changes: 0 additions & 5 deletions config/configure_branch_for_test.txt

This file was deleted.

100 changes: 1 addition & 99 deletions config/get-branch.sh
Original file line number Diff line number Diff line change
@@ -1,114 +1,16 @@
#$CODEBUILD_WEBHOOK_BASE_REF IS DESTINATION BRANCH FOR PR.
#$CODEBUILD_GIT_BRANCH IS CURRENT BRANCH FOR THE REPO WHICH TRIGGERED BUILD.
core_repo="tornasole_core"
rules_repo="tornasole_rules"
tf_repo="tornasole_tf"
mxnet_repo="tornasole_mxnet"



if [ -z "${CODEBUILD_BUILD_IMAGE##*tensorflow*}" ] ; then export framework="tensorflow";
elif [ -z "${CODEBUILD_BUILD_IMAGE##*mxnet*}" ] ; then export framework="mxnet";
elif [ -z "${CODEBUILD_BUILD_IMAGE##*pytorch*}" ] ; then export framework="pytorch";
fi

export CODEBUILD_GIT_BRANCH="$(git symbolic-ref HEAD --short 2>/dev/null)"
if [ "$CODEBUILD_GIT_BRANCH" = "" ] ; then
CODEBUILD_GIT_BRANCH="$(git branch -a --contains HEAD | sed -n 2p | awk '{ printf $1 }')";
export CODEBUILD_GIT_BRANCH=${CODEBUILD_GIT_BRANCH#remotes/origin/};
fi
SUBSTRING=$(echo $CODEBUILD_WEBHOOK_BASE_REF| cut -d'/' -f 3)
BRANCH=''
if [ "$CODEBUILD_WEBHOOK_EVENT" = "PULL_REQUEST_CREATED" ] || [ "$CODEBUILD_WEBHOOK_EVENT" = "PULL_REQUEST_REOPENED" ] || [ "$CODEBUILD_WEBHOOK_EVENT" = "PULL_REQUEST_UPDATED" ] || [ "$CODEBUILD_WEBHOOK_EVENT" = "PULL_REQUEST_MERGED" ] && [ "$CODEBUILD_WEBHOOK_EVENT" != "PUSH" ]; then
BRANCH=$SUBSTRING

elif [ "$CODEBUILD_WEBHOOK_EVENT" != "PULL_REQUEST_CREATED" ] && [ "$CODEBUILD_WEBHOOK_EVENT" != "PULL_REQUEST_REOPENED" ] && [ "$CODEBUILD_WEBHOOK_EVENT" != "PULL_REQUEST_UPDATED" ] && [ "$CODEBUILD_WEBHOOK_EVENT" != "PULL_REQUEST_MERGED" ] && [ "$CODEBUILD_GIT_BRANCH" != "alpha" ] && [ "$CODEBUILD_GIT_BRANCH" != "master" ] ; then
cd $CODEBUILD_SRC_DIR && git checkout $CODEBUILD_GIT_BRANCH
if [ $(git merge-base --is-ancestor $CODEBUILD_GIT_BRANCH "alpha" ; echo $?) -eq 1 ]; then
BRANCH='alpha'

elif [ $(git merge-base --is-ancestor $CODEBUILD_GIT_BRANCH "alpha" ; echo $?) -eq 0 ]; then
BRANCH='master'

fi
cd ..

else BRANCH=$CODEBUILD_GIT_BRANCH
fi

TF_BRANCH=$BRANCH ;
CORE_BRANCH=$BRANCH ;
RULES_BRANCH=$BRANCH ;
MXNET_BRANCH=$BRANCH ;


if [ "$CODEBUILD_GIT_BRANCH" != "alpha" ] && [ "$CODEBUILD_GIT_BRANCH" != "master" ] && [ "$CODEBUILD_WEBHOOK_EVENT" != "PUSH" ] ; then
file="config/configure_branch_for_test.txt"
while IFS=: read -r repo_name default_or_branchname
do
if [ "$repo_name" = "$tf_repo" ] && [ "$default_or_branchname" != "default" ]; then
TF_BRANCH=$default_or_branchname
elif [ "$repo_name" = "$mxnet_repo" ] && [ "$default_or_branchname" != "default" ] ; then
MXNET_BRANCH=$default_or_branchname
elif [ "$repo_name" = "$rules_repo" ] && [ "$default_or_branchname" != "default" ] ; then
RULES_BRANCH=$default_or_branchname
elif [ "$repo_name" = "$core_repo" ] && [ "$default_or_branchname" != "default" ] ; then
CORE_BRANCH=$default_or_branchname
fi

done <"$file"
fi

cd $CODEBUILD_SRC_DIR && git checkout $CODEBUILD_GIT_BRANCH
export CURRENT_COMMIT_HASH=$(git log -1 --pretty=%h);
export CURRENT_COMMIT_DATE="$(git show -s --format=%ci | cut -d' ' -f 1)_$(git show -s --format=%ci | cut -d' ' -f 2)";
export CURRENT_REPO_NAME=$(basename `git rev-parse --show-toplevel`) ;
export CURRENT_COMMIT_PATH="$CODEBUILD_SRC_DIR/wheels/$CURRENT_COMMIT_DATE/$CURRENT_REPO_NAME/$CURRENT_COMMIT_HASH"
export CURRENT_COMMIT_PATH="$CURRENT_COMMIT_DATE/$CURRENT_REPO_NAME/$CURRENT_COMMIT_HASH"
cd ..

if [ "$CURRENT_REPO_NAME" != "$core_repo" ]; then
cd $CODEBUILD_SRC_DIR_tornasole_core && git checkout $CORE_BRANCH
export CORE_REPO_NAME=$(basename `git rev-parse --show-toplevel`) ;
export CORE_COMMIT_HASH=$(git log -1 --pretty=%h);
export CORE_COMMIT_DATE="$(git show -s --format=%ci | cut -d' ' -f 1)_$(git show -s --format=%ci | cut -d' ' -f 2)";
export CORE_PATH="$CODEBUILD_SRC_DIR/wheels/$CORE_COMMIT_DATE/$CORE_REPO_NAME/$CORE_COMMIT_HASH"
cd ..
fi

if [ "$CURRENT_REPO_NAME" != "$rules_repo" ]; then
cd $CODEBUILD_SRC_DIR_tornasole_rules && git checkout $RULES_BRANCH
export RULES_REPO_NAME=$(basename `git rev-parse --show-toplevel`) ;
export RULES_COMMIT_HASH=$(git log -1 --pretty=%h);
export RULES_COMMIT_DATE="$(git show -s --format=%ci | cut -d' ' -f 1)_$(git show -s --format=%ci | cut -d' ' -f 2)";
export RULES_PATH="$CODEBUILD_SRC_DIR/wheels/$RULES_COMMIT_DATE/$RULES_REPO_NAME/$RULES_COMMIT_HASH"
cd ..
fi

if [ "$CURRENT_REPO_NAME" != "$mxnet_repo" ]; then
cd $CODEBUILD_SRC_DIR_tornasole_mxnet && git checkout $MXNET_BRANCH
export MXNET_REPO_NAME=$(basename `git rev-parse --show-toplevel`) ;
export MXNET_COMMIT_HASH=$(git log -1 --pretty=%h);
export MXNET_COMMIT_DATE="$(git show -s --format=%ci | cut -d' ' -f 1)_$(git show -s --format=%ci | cut -d' ' -f 2)";
export MXNET_PATH="$CODEBUILD_SRC_DIR/wheels/$MXNET_COMMIT_DATE/$MXNET_REPO_NAME/$MXNET_COMMIT_HASH"
cd ..
fi

if [ "$CURRENT_REPO_NAME" != "$tf_repo" ]; then
cd $CODEBUILD_SRC_DIR_tornasole_tf && git checkout $TF_BRANCH
export TF_REPO_NAME=$(basename `git rev-parse --show-toplevel`) ;
export TF_COMMIT_HASH=$(git log -1 --pretty=%h);
export TF_COMMIT_DATE="$(git show -s --format=%ci | cut -d' ' -f 1)_$(git show -s --format=%ci | cut -d' ' -f 2)";
export TF_PATH="$CODEBUILD_SRC_DIR/wheels/$TF_COMMIT_DATE/$TF_REPO_NAME/$TF_COMMIT_HASH"
cd ..
fi

export TF_BRANCH ;
export CORE_BRANCH ;
export RULES_BRANCH ;
export MXNET_BRANCH ;




export CODEBUILD_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export CODEBUILD_PROJECT=${CODEBUILD_BUILD_ID%:$CODEBUILD_LOG_PATH}
Expand Down
3 changes: 1 addition & 2 deletions config/protoc_downloader.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,4 @@ PROTOC_ZIP=protoc-3.7.1-linux-x86_64.zip
curl -OL https://github.com/google/protobuf/releases/download/v3.7.1/$PROTOC_ZIP
unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
unzip -o $PROTOC_ZIP -d /usr/local include/*
rm -f $PROTOC_ZIP

rm -f $PROTOC_ZIP
32 changes: 6 additions & 26 deletions config/tests.sh
Original file line number Diff line number Diff line change
@@ -1,28 +1,8 @@
if [ -z "$framework" ]
then
echo "framework is not mentioned"
exit 1
fi
#!/usr/bin/env bash

if [ "$framework" = "tensorflow" ]
then
echo "Launching testing job using $framework framework"
#export TORNASOLE_LOG_LEVEL=debug
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/report.html --self-contained-html tests/
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_tensorflow.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode tensorflow --path_to_config ./tests/analysis/config.yaml
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_mxnet.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode mxnet --path_to_config ./tests/analysis/config.yaml
TORNASOLE_LOG_LEVEL=debug python -m pytest --html=upload/$CURRENT_COMMIT_PATH/reports/test_rules_pytorch.html --self-contained-html -s tests/analysis/integration_testing_rules.py::test_test_rules --mode pytorch --path_to_config ./tests/analysis/config.yaml


elif [ "$framework" = "mxnet" ]
then
echo "Launching testing job using $framework framework"

elif [ "$framework" = "pytorch" ]
then
echo "Launching testing job using $framework framework"


else
echo "$framework framework not supported!!!"
exit 1

fi

export TORNASOLE_LOG_LEVEL=debug
python -m pytest tests/
Loading

0 comments on commit 494498e

Please sign in to comment.