Skip to content

Commit

Permalink
Standardize rule constructor to take trials/strings. Improve rule inv…
Browse files Browse the repository at this point in the history
…oker (aws#77)

* WIP, added tf and core

* WIP

* add code from all repos, and fix imports

* fix more imports, add tests

* add docs, examples

* fix imports in examples

* fix setup.py and CI

* fix test invoker

* Reload a step directory when it was last seen as empty (aws#117)

* fix imports

* fix new imports

* unskip test

* Add setup.py

* undo end of training merge

* remove import

* Add training end code

* add frameworks

* fix function used

* update setup to use append

* fixing small errors (aws#74)

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* trigger ci

* trigger ci

* trigger ci

* trigger ci

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* testing

* uploading test reports to s3

* uploading test reports to s3

* uploading test reports to s3

* uploading test reports to s3

* changes

* changes

* docs

* Add subpackages in core

* docs and examples

* provides trials and rules as part of main namescope

* move rules and trials outside

* fix training end tests, and update setup.py

* new readme for whole repo

* fix setup.py

* update packages

* make the mxnet tests faster

* reduce lenght of integration tests

* add script to build binaries

* update argument

* change num steps and frequency

* delete path

* add boto3

* fix training end tests

* changes

* refactor rules, rule_invoker to take any argument from the rule

* cleanup

* move exceptions to its own module

* fix links

* update readme

* update version string in setup.py

* uncommented test

* making the pytorch stuff up to date (aws#79)

* making the pytorch stuff up to date

* reverting util.py

* fixing the hook imports

* fixing test imports

* fix increment of step

* training_has_ended fix for pytorch (aws#80)

* making the pytorch stuff up to date

* Revert "making the pytorch stuff up to date"

This reverts commit f87f9560b5351f135553072c495f2123964b9f3c.

* changing to training_has_ended

* fix comparision in allzero

* Unchanged tensor rule (aws#82)

* WIP unchanged tensor

* WIP

* WIP2

* add test for the rule

* fix steps mess in rule

* add rule invocation to readme

* add doc

* review comments

* remove new exception
  • Loading branch information
rahul003 authored Aug 14, 2019
1 parent 494498e commit 36da831
Show file tree
Hide file tree
Showing 14 changed files with 305 additions and 107 deletions.
32 changes: 32 additions & 0 deletions docs/analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,12 @@ class VanishingGradientRule(Rule):
super().__init__(base_trial, other_trials=None)
self.threshold = threshold
```

Please note that apart from `base_trial` and `other_trials` (if required), we require all
arguments of the rule constructor to take a string as value. This means if you want to pass
a list of strings, you might want to pass them as a comma separated string. This restriction is
being enforced so as to let you create and invoke rules from json using Sagemaker's APIs.

##### RequiredTensors

Next you need to implement a method which lets Tornasole know what tensors you
Expand Down Expand Up @@ -443,6 +449,32 @@ tensor_regex = ['input*']
allzero = AllZero(base_trial=trial_obj, collection_names=collections, tensor_regex=tensor_regex)
```

##### UnchangedTensor
This rule helps to identify whether a tensor is not changing across steps.
This rule runs `numpy.allclose` method to check if the tensor is unchanged.
It takes following arguments

- `base_trial`: The trial whose execution will invoke the rule. The rule will inspect the tensors gathered during this trial.
- `collection_names`: The list of collection names. The rule will inspect the tensors that belong to these collections.
- `tensor_regex`: The list of regex patterns. The rule will inspect the tensors that match the regex patterns specified in this list.
- `num_steps`: Number of steps across which we check if the tensor has changed. Note that this checks the last num_steps that are available.
They need not be consecutive.
If num_steps is 2, at step `s` it does not necessarily check for s-1 and s.
If s-1 is not available, it checks the last available step along with s.
In that case it checks the last available step with the current step.
- `rtol`: The relative tolerance parameter, as a float, to be passed to numpy.allclose.
- `atol`: The absolute tolerance parameter, as a float, to be passed to numpy.allclose
- `equal_nan`: Whether to compare NaN’s as equal. If True, NaN’s in a will be considered
equal to NaN’s in b in the output array. This will be passed to numpy.allclose method

For this rule, users must specify either the `collection_names` or `tensor_regex` parameter.
If both the parameters are specified the rule will inspect union on tensors.

```
from tornasole.rules.generic import UnchangedTensor
ut = UnchangedTensor(base_trial=trial_obj, tensor_regex=['.*'], num_steps=3)
```

## Examples
We have a few example scripts and notebooks to help you get started. Please go to the `examples` folder.

Expand Down
12 changes: 10 additions & 2 deletions examples/tensorflow/training_scripts/resnet50/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,8 @@ python train_imagenet_resnet_hvd.py --clear_log --enable_tornasole \
--tornasole_path ~/ts_outputs/vanishing
```

You can monitor the exploding tensors by doing the following
##### Rule: VanishingGradient
You can monitor vanishing gradients by doing the following
```
python -m tornasole.rules.rule_invoker --trial-dir ~/ts_outputs/vanishing --rule-name VanishingGradient
```
Expand Down Expand Up @@ -141,7 +142,7 @@ python train_imagenet_resnet_hvd.py --clear_log --enable_tornasole \
--tornasole_step_interval 1 \
--tornasole_path ~/ts_outputs/weights
```

##### Rule: WeightUpdateRatio
You can invoke the rule to
monitor the ratio of weights to updates every step.
A quick way to invoke the rule is like this:
Expand All @@ -152,6 +153,13 @@ If you want to customize the thresholds, refer to the example in `analysis`:
[`examples/analysis/scripts/weight_update_ratio.py`](examples/analysis/scripts/weight_update_ratio.py)
and the [Rule API](docs/analysis/README.md#rules-api)

##### Rule: UnchangedTensor
You can also invoke this rule to
monitor if tensors are not changing at every step. Here we are passing '.*' as the tensor_regex to monitor all tensors.
```
python -m tornasole.rules.rule_invoker --trial-dir ~/ts_outputs/weights --rule-name UnchangedTensor --tensor_regex .*
```

#### Running with tornasole disabled
```
python train_imagenet_resnet_hvd.py --clear_log
Expand Down
4 changes: 2 additions & 2 deletions tests/analysis/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,15 +131,15 @@
- [*mnist_gluon_basic_hook_demo,
--random_seed True --num_steps 5,
*invoker,
--rule_name allzero --flag False --end_step 3 --collection weights
--rule_name allzero --flag False --end_step 3 --collection_names weights
]
- # all_zero/mxnet/true
- mxnet
- *Enable
- [*mnist_gluon_all_zero_demo,
--random_seed True --num_steps 5,
*invoker,
--rule_name allzero --flag True --end_step 3 --collections weights --collections ReluActivation --collections bias
--rule_name allzero --flag True --end_step 3 --collection_names weights,ReluActivation,bias
]

# test cases for pytorch
Expand Down
70 changes: 23 additions & 47 deletions tests/analysis/invoker.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
from tornasole.exceptions import *
from tornasole.core.utils import get_logger
from tornasole.rules.rule_invoker import create_rule

logger = get_logger()

def invoke_rule(rule_obj, flag, start_step, end_step):
step = start_step if start_step is not None else 0
logger.info('Started execution of rule {}'.format(type(rule_obj).__name__))
logger.info('Started execution of rule {} at step {}'
.format(type(rule_obj).__name__, step))
return_false = False
while (end_step is None) or (step < end_step): # if end_step is not provided, do infinite checking
try:
Expand All @@ -29,58 +32,31 @@ def invoke_rule(rule_obj, flag, start_step, end_step):
# if flag is False, return_false should be True after the loop
if flag == 'False':
assert return_false
logger.info('Ending execution of rule {} with step={} '.format(rule_obj.__class__.__name__, step))
logger.info('Ending execution of rule {} with step={} '
.format(rule_obj.__class__.__name__, step - 1))


if __name__ == '__main__':
import argparse
from tornasole.trials import create_trial

parser = argparse.ArgumentParser()
parser.add_argument('--tornasole_path', type=str)
parser.add_argument('--rule_name', type=str)
parser = argparse.ArgumentParser(description='Rule invoker takes the below arguments and'
'any argument taken by the rules. The arguments not'
'mentioned below are automatically passed when'
'creating the rule objects.')
parser.add_argument('--tornasole_path', type=str, required=True)
parser.add_argument('--rule_name', type=str, required=True)
parser.add_argument('--other-trials', type=str,
help='comma separated paths for '
'other trials taken by the rule')
parser.add_argument('--start_step', type=int)
parser.add_argument('--end_step', type=int)
parser.add_argument('--flag', type=str, default=None)

parser.add_argument('--weightupdateratio_large_threshold', type=float, default=10)
parser.add_argument('--weightupdateratio_small_threshold', type=float, default=0.00000001)

parser.add_argument('--vanishinggradient_threshold', type=float, default=0.0000001)
parser.add_argument('--collections', default=[], type=str, action='append',
help="""List of collection names. The rule will inspect tensors belonging to those collections. Required for allzero
rule.""")
parser.add_argument('--tensor-regex', default=[], type=str, action='append',
help="""List of regex patterns. The rule will inspect tensors that match these
patterns. Required for allzero
rule.""")
parsed, unknown = parser.parse_known_args()
for arg in unknown:
if arg.startswith('--'):
parser.add_argument(arg, type=str)
args = parser.parse_args()
if args.rule_name is None:
raise RuntimeError('Needs rule name to invoke')

tr = create_trial(args.tornasole_path, range_steps=(args.start_step, args.end_step))
if args.rule_name.lower() == 'vanishinggradient':
from tornasole.rules.generic import VanishingGradient
r = VanishingGradient(tr, threshold=args.vanishinggradient_threshold)
elif args.rule_name.lower() == 'explodingtensor':
from tornasole.rules.generic import ExplodingTensor
r = ExplodingTensor(tr)
elif args.rule_name.lower() == 'weightupdateratio':
from tornasole.rules.generic import WeightUpdateRatio
r = WeightUpdateRatio(tr,
large_threshold=args.weightupdateratio_large_threshold,
small_threshold=args.weightupdateratio_small_threshold)
elif args.rule_name.lower() == 'allzero':
if len(args.collections) == 0 and len(args.tensor_regex) == 0:
raise ValueError('Please provide either the list of collection names or list of regex patterns for invoking '
'this rule.')
from tornasole.rules.generic import AllZero
r = AllZero(tr, args.collections, args.tensor_regex)
else:
raise ValueError('Please invoke any rules which take multiple trials, '
'or custom rules by passing the rule object to '
'invoke_rule() function. We do not currently '
'support running such rules from this python script.'
'Please refer to examples/scripts/ for examples'
'on how to call invoke_rule')
args_dict = vars(args)
# to standardize args for create_rule function
args.trial_dir = args.tornasole_path
r = create_rule(args, args_dict)
invoke_rule(r, flag=args.flag, start_step=args.start_step, end_step=args.end_step)
46 changes: 43 additions & 3 deletions tests/analysis/rules/test_invoker.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,11 @@
from tornasole.exceptions import *
from tornasole.rules.rule_invoker import invoke_rule

def test_invoker_exception():
import subprocess
import sys
import shutil

def dump_data():
run_id = str(uuid.uuid4())
base_path = 'ts_output/rule_invoker/'
path = base_path + run_id
Expand All @@ -22,14 +26,50 @@ def test_invoker_exception():
generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
step=2, tname_prefix='foo', worker='algo-1', shape=(1,),
data=np.array([np.nan]))
return path

def test_invoker_exception():
path = dump_data()
tr = create_trial(path)
r = ExplodingTensor(tr)

c = 0
for start_step in range(2):
try:
invoke_rule(r, start_step=start_step, end_step=3, raise_rule_eval=True)
invoke_rule(r, start_step=start_step, end_step=3, raise_eval_cond=True)
except RuleEvaluationConditionMet as e:
c += 1
assert c == 2
assert c == 2
shutil.rmtree(path)


def test_invoker_rule_default_args():
path = dump_data()
rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
'--trial-dir', path,
'--rule-name', 'VanishingGradient',
'--end-step', '3'])
assert rcode == 0
shutil.rmtree(path)

def test_invoker_rule_pass_kwargs():
path = dump_data()
rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
'--rule-name', 'VanishingGradient',
'--trial-dir', path,
'--threshold', '0.001',
'--end-step', '3'])
assert rcode == 0
shutil.rmtree(path)

def test_invoker_rule_pass_other_trials():
path1 = dump_data()
path2 = dump_data()
rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
'--trial-dir', path1,
'--other-trials', path2,
'--rule-name', 'SimilarAcrossRuns',
'--end-step', '3'])
assert rcode == 0
shutil.rmtree(path1)
shutil.rmtree(path2)
49 changes: 49 additions & 0 deletions tests/analysis/rules/test_unchanged.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from tests.analysis.utils import generate_data

from tornasole.rules.generic import UnchangedTensor
from tornasole.trials import create_trial
import uuid
import numpy as np
from tornasole.exceptions import *
from tornasole.rules.rule_invoker import invoke_rule

def test_unchanged():
run_id = str(uuid.uuid4())
base_path = 'ts_output/rule_invoker/'
path = base_path + run_id

num_tensors = 3

shape = (10, 3, 2)
generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
step=0, tname_prefix='foo', worker='algo-1', shape=shape,
data=np.ones(shape=shape))
generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
step=1, tname_prefix='foo', worker='algo-1', shape=shape,
data=np.ones(shape=shape))
generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
step=2, tname_prefix='foo', worker='algo-1', shape=shape,
data=np.ones(shape=shape))

generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
step=5, tname_prefix='boo', worker='algo-1', shape=shape,
data=np.ones(shape=shape))

tr = create_trial(path)
r = UnchangedTensor(tr, tensor_regex='.*')

invoke_rule(r, start_step=0, end_step=2, raise_eval_cond=True)

try:
invoke_rule(r, start_step=0, end_step=3, raise_eval_cond=True)
assert False
except RuleEvaluationConditionMet:
pass

try:
invoke_rule(r, start_step=2, end_step=3, raise_eval_cond=True)
assert False
except RuleEvaluationConditionMet:
pass

invoke_rule(r, start_step=3, end_step=6, raise_eval_cond=True)
2 changes: 2 additions & 0 deletions tests/analysis/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ def generate_data(path, trial, step, tname_prefix,
c = CollectionManager()
c.add("default")
c.get("default").tensor_names = ["foo_" + str(i) for i in range(num_tensors)]
c.add('gradients')
c.get("gradients").tensor_names = ["foo_" + str(i) for i in range(num_tensors)]
c.export(os.path.join(path, trial, "collections.ts"))


Expand Down
31 changes: 31 additions & 0 deletions tornasole/analysis/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,34 @@ def refresh(trials):
else:
trial = trials
trial.dynamic_refresh = False


def parse_list_from_str(arg, delimiter=','):
"""
:param arg: string or list of strings
if it is string it is treated as character delimited string
:param delimiter: string
if arg is a string, this delimiter is used to split the string
:return: list of strings
"""
if arg is None:
rval = []
if isinstance(arg, str):
if len(arg) == 0:
rval = []
else:
rval = arg.split(delimiter)
return rval

def parse_bool(arg, default):
if arg is None:
return default
elif arg in [False, True]:
return arg
elif arg == 'False':
return False
elif arg == 'True':
return True
else:
raise ValueError('boolean argument expected, '
'but found {}'.format(arg))
17 changes: 17 additions & 0 deletions tornasole/core/tensor.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from tornasole.core.reductions import get_numpy_reduction
from tornasole.core.modes import ModeKeys
import bisect
from tornasole.exceptions import *

from enum import Enum
Expand Down Expand Up @@ -223,3 +224,19 @@ def add_reduction_step(self, mode, mode_step, red_name, abs, red_value):
self._mode_steps[mode].set_step_reduction_value(mode_step,
red_name, abs, red_value)

def prev_steps(self, step, n, mode=ModeKeys.GLOBAL):
"""
returns n prev steps from step representing step number
of given mode
:param step: int
step number
:param n: int
number of previous steps to return
:param mode: value of the enum tornasole.modes
modes.GLOBAL, modes.TRAIN, modes.EVAL, modes.PREDICT
:return: a list of step numbers
"""
steps = self.steps(mode=mode)
i = bisect.bisect_right(steps, step)
prev_steps = steps[:i]
return prev_steps[-n:]
2 changes: 1 addition & 1 deletion tornasole/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,4 +117,4 @@ def index(sorted_list, elem):
i = bisect.bisect_left(sorted_list, elem)
if i != len(sorted_list) and sorted_list[i] == elem:
return i
raise ValueError
raise ValueError
1 change: 0 additions & 1 deletion tornasole/rules/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
from .rule import RequiredTensors, Rule
from .rule_invoker import invoke_rule
Loading

0 comments on commit 36da831

Please sign in to comment.