Standardize rule constructor to take trials/strings. Improve rule inv…

…oker (aws#77) * WIP, added tf and core * WIP * add code from all repos, and fix imports * fix more imports, add tests * add docs, examples * fix imports in examples * fix setup.py and CI * fix test invoker * Reload a step directory when it was last seen as empty (aws#117) * fix imports * fix new imports * unskip test * Add setup.py * undo end of training merge * remove import * Add training end code * add frameworks * fix function used * update setup to use append * fixing small errors (aws#74) * testing * testing * testing * testing * testing * testing * testing * trigger ci * trigger ci * trigger ci * trigger ci * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * testing * uploading test reports to s3 * uploading test reports to s3 * uploading test reports to s3 * uploading test reports to s3 * changes * changes * docs * Add subpackages in core * docs and examples * provides trials and rules as part of main namescope * move rules and trials outside * fix training end tests, and update setup.py * new readme for whole repo * fix setup.py * update packages * make the mxnet tests faster * reduce lenght of integration tests * add script to build binaries * update argument * change num steps and frequency * delete path * add boto3 * fix training end tests * changes * refactor rules, rule_invoker to take any argument from the rule * cleanup * move exceptions to its own module * fix links * update readme * update version string in setup.py * uncommented test * making the pytorch stuff up to date (aws#79) * making the pytorch stuff up to date * reverting util.py * fixing the hook imports * fixing test imports * fix increment of step * training_has_ended fix for pytorch (aws#80) * making the pytorch stuff up to date * Revert "making the pytorch stuff up to date" This reverts commit f87f9560b5351f135553072c495f2123964b9f3c. * changing to training_has_ended * fix comparision in allzero * Unchanged tensor rule (aws#82) * WIP unchanged tensor * WIP * WIP2 * add test for the rule * fix steps mess in rule * add rule invocation to readme * add doc * review comments * remove new exception
atqy · Aug 14, 2019 · 36da831 · 36da831
1 parent 494498e
commit 36da831
Show file tree

Hide file tree

Showing 14 changed files with 305 additions and 107 deletions.
diff --git a/docs/analysis/README.md b/docs/analysis/README.md
@@ -285,6 +285,12 @@ class VanishingGradientRule(Rule):
         super().__init__(base_trial, other_trials=None)
         self.threshold = threshold
 ```
+
+Please note that apart from `base_trial` and `other_trials` (if required), we require all 
+arguments of the rule constructor to take a string as value. This means if you want to pass
+a list of strings, you might want to pass them as a comma separated string. This restriction is
+being enforced so as to let you create and invoke rules from json using Sagemaker's APIs.  
+
 ##### RequiredTensors
 
 Next you need to implement a method which lets Tornasole know what tensors you 
@@ -443,6 +449,32 @@ tensor_regex = ['input*']
 allzero = AllZero(base_trial=trial_obj, collection_names=collections, tensor_regex=tensor_regex)
 ```
 
+##### UnchangedTensor
+This rule helps to identify whether a tensor is not changing across steps. 
+This rule runs `numpy.allclose` method to check if the tensor is unchanged. 
+It takes following arguments
+
+- `base_trial`: The trial whose execution will invoke the rule. The rule will inspect the tensors gathered during this trial.
+- `collection_names`: The list of collection names. The rule will inspect the tensors that belong to these collections.
+- `tensor_regex`: The list of regex patterns. The rule will inspect the tensors that match the regex patterns specified in this list.
+- `num_steps`: Number of steps across which we check if the tensor has changed. Note that this checks the last num_steps that are available. 
+They need not be consecutive.
+If num_steps is 2, at step `s` it does not necessarily check for s-1 and s. 
+If s-1 is not available, it checks the last available step along with s. 
+In that case it checks the last available step with the current step.
+- `rtol`: The relative tolerance parameter, as a float, to be passed to numpy.allclose. 
+- `atol`: The absolute tolerance parameter, as a float, to be passed to numpy.allclose
+- `equal_nan`: Whether to compare NaN’s as equal. If True, NaN’s in a will be considered 
+equal to NaN’s in b in the output array. This will be passed to numpy.allclose method
+
+For this rule, users must specify either the `collection_names` or `tensor_regex` parameter. 
+If both the parameters are specified the rule will inspect union on tensors.
+
+```
+from tornasole.rules.generic import UnchangedTensor
+ut = UnchangedTensor(base_trial=trial_obj, tensor_regex=['.*'], num_steps=3)
+```
+
 ## Examples
 We have a few example scripts and notebooks to help you get started. Please go to the `examples` folder. 
 

diff --git a/examples/tensorflow/training_scripts/resnet50/README.md b/examples/tensorflow/training_scripts/resnet50/README.md
@@ -112,7 +112,8 @@ python train_imagenet_resnet_hvd.py --clear_log --enable_tornasole \
     --tornasole_path ~/ts_outputs/vanishing  
 ``` 
 
-You can monitor the exploding tensors by doing the following
+##### Rule: VanishingGradient
+You can monitor vanishing gradients by doing the following
 ```
 python -m tornasole.rules.rule_invoker --trial-dir ~/ts_outputs/vanishing --rule-name VanishingGradient
 ``` 
@@ -141,7 +142,7 @@ python train_imagenet_resnet_hvd.py --clear_log --enable_tornasole \
     --tornasole_step_interval 1 \
     --tornasole_path ~/ts_outputs/weights
 ```
-
+##### Rule: WeightUpdateRatio 
 You can invoke the rule to 
 monitor the ratio of weights to updates every step. 
 A quick way to invoke the rule is like this: 
@@ -152,6 +153,13 @@ If you want to customize the thresholds, refer to the example in `analysis`:
 [`examples/analysis/scripts/weight_update_ratio.py`](examples/analysis/scripts/weight_update_ratio.py) 
 and the [Rule API](docs/analysis/README.md#rules-api)
 
+##### Rule: UnchangedTensor
+You can also invoke this rule to 
+monitor if tensors are not changing at every step. Here we are passing '.*' as the tensor_regex to monitor all tensors.
+```
+python -m tornasole.rules.rule_invoker --trial-dir ~/ts_outputs/weights --rule-name UnchangedTensor --tensor_regex .*
+```
+
 #### Running with tornasole disabled
 ```
 python train_imagenet_resnet_hvd.py --clear_log

diff --git a/tests/analysis/config.yaml b/tests/analysis/config.yaml
@@ -131,15 +131,15 @@
   - [*mnist_gluon_basic_hook_demo,
      --random_seed True --num_steps 5,
      *invoker,
-     --rule_name allzero --flag False --end_step 3 --collection weights
+     --rule_name allzero --flag False --end_step 3 --collection_names weights
     ]
 - # all_zero/mxnet/true
   - mxnet
   - *Enable
   - [*mnist_gluon_all_zero_demo,
      --random_seed True --num_steps 5,
      *invoker,
-     --rule_name allzero --flag True --end_step 3 --collections weights --collections ReluActivation --collections bias
+     --rule_name allzero --flag True --end_step 3 --collection_names weights,ReluActivation,bias
     ]
 
 # test cases for pytorch

diff --git a/tests/analysis/invoker.py b/tests/analysis/invoker.py
@@ -1,10 +1,13 @@
 from tornasole.exceptions import *
 from tornasole.core.utils import get_logger
+from tornasole.rules.rule_invoker import create_rule
+
 logger = get_logger()
 
 def invoke_rule(rule_obj, flag, start_step, end_step):
     step = start_step if start_step is not None else 0
-    logger.info('Started execution of rule {}'.format(type(rule_obj).__name__))
+    logger.info('Started execution of rule {} at step {}'
+                .format(type(rule_obj).__name__, step))
     return_false = False
     while (end_step is None) or (step < end_step): # if end_step is not provided, do infinite checking
         try:
@@ -29,58 +32,31 @@ def invoke_rule(rule_obj, flag, start_step, end_step):
     # if flag is False, return_false should be True after the loop
     if flag == 'False':
         assert return_false
-    logger.info('Ending execution of rule {} with step={} '.format(rule_obj.__class__.__name__, step))
+    logger.info('Ending execution of rule {} with step={} '
+                .format(rule_obj.__class__.__name__, step - 1))
 
 
 if __name__ == '__main__':
   import argparse
-  from tornasole.trials import create_trial
-
-  parser = argparse.ArgumentParser()
-  parser.add_argument('--tornasole_path', type=str)
-  parser.add_argument('--rule_name', type=str)
+  parser = argparse.ArgumentParser(description='Rule invoker takes the below arguments and'
+                                               'any argument taken by the rules. The arguments not'
+                                               'mentioned below are automatically passed when'
+                                               'creating the rule objects.')
+  parser.add_argument('--tornasole_path', type=str, required=True)
+  parser.add_argument('--rule_name', type=str, required=True)
+  parser.add_argument('--other-trials', type=str,
+                      help='comma separated paths for '
+                           'other trials taken by the rule')
   parser.add_argument('--start_step', type=int)
   parser.add_argument('--end_step', type=int)
   parser.add_argument('--flag', type=str, default=None)
-
-  parser.add_argument('--weightupdateratio_large_threshold', type=float, default=10)
-  parser.add_argument('--weightupdateratio_small_threshold', type=float, default=0.00000001)
-
-  parser.add_argument('--vanishinggradient_threshold', type=float, default=0.0000001)
-  parser.add_argument('--collections', default=[], type=str, action='append',
-                      help="""List of collection names. The rule will inspect tensors belonging to those collections. Required for allzero 
-                      rule.""")
-  parser.add_argument('--tensor-regex', default=[], type=str, action='append',
-                      help="""List of regex patterns. The rule will inspect tensors that match these 
-                      patterns. Required for allzero 
-                      rule.""")
+  parsed, unknown = parser.parse_known_args()
+  for arg in unknown:
+    if arg.startswith('--'):
+      parser.add_argument(arg, type=str)
   args = parser.parse_args()
-  if args.rule_name is None:
-    raise RuntimeError('Needs rule name to invoke')
-
-  tr = create_trial(args.tornasole_path, range_steps=(args.start_step, args.end_step))
-  if args.rule_name.lower() == 'vanishinggradient':
-    from tornasole.rules.generic import VanishingGradient
-    r = VanishingGradient(tr, threshold=args.vanishinggradient_threshold)
-  elif args.rule_name.lower() == 'explodingtensor':
-    from tornasole.rules.generic import ExplodingTensor
-    r = ExplodingTensor(tr)
-  elif args.rule_name.lower() == 'weightupdateratio':
-    from tornasole.rules.generic import WeightUpdateRatio
-    r = WeightUpdateRatio(tr,
-                          large_threshold=args.weightupdateratio_large_threshold,
-                          small_threshold=args.weightupdateratio_small_threshold)
-  elif args.rule_name.lower() == 'allzero':
-    if len(args.collections) == 0 and len(args.tensor_regex) == 0:
-      raise ValueError('Please provide either the list of collection names or list of regex patterns for invoking '
-                       'this rule.')
-    from tornasole.rules.generic import AllZero
-    r = AllZero(tr, args.collections, args.tensor_regex)
-  else:
-    raise ValueError('Please invoke any rules which take multiple trials, '
-                     'or custom rules by passing the rule object to '
-                     'invoke_rule() function. We do not currently '
-                     'support running such rules from this python script.'
-                     'Please refer to examples/scripts/ for examples'
-                     'on how to call invoke_rule')
+  args_dict = vars(args)
+  # to standardize args for create_rule function
+  args.trial_dir = args.tornasole_path
+  r = create_rule(args, args_dict)
   invoke_rule(r, flag=args.flag, start_step=args.start_step, end_step=args.end_step)
diff --git a/tests/analysis/rules/test_invoker.py b/tests/analysis/rules/test_invoker.py
@@ -7,7 +7,11 @@
 from tornasole.exceptions import *
 from tornasole.rules.rule_invoker import invoke_rule
 
-def test_invoker_exception():
+import subprocess
+import sys
+import shutil
+
+def dump_data():
   run_id = str(uuid.uuid4())
   base_path = 'ts_output/rule_invoker/'
   path = base_path + run_id
@@ -22,14 +26,50 @@ def test_invoker_exception():
   generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
                 step=2, tname_prefix='foo', worker='algo-1', shape=(1,),
                 data=np.array([np.nan]))
+  return path
 
+def test_invoker_exception():
+  path = dump_data()
   tr = create_trial(path)
   r = ExplodingTensor(tr)
 
   c = 0
   for start_step in range(2):
     try:
-      invoke_rule(r, start_step=start_step, end_step=3, raise_rule_eval=True)
+      invoke_rule(r, start_step=start_step, end_step=3, raise_eval_cond=True)
     except RuleEvaluationConditionMet as e:
       c += 1
-  assert c == 2
+  assert c == 2
+  shutil.rmtree(path)
+
+
+def test_invoker_rule_default_args():
+  path = dump_data()
+  rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
+                                '--trial-dir', path,
+                                '--rule-name', 'VanishingGradient',
+                                '--end-step', '3'])
+  assert rcode == 0
+  shutil.rmtree(path)
+
+def test_invoker_rule_pass_kwargs():
+  path = dump_data()
+  rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
+                                '--rule-name', 'VanishingGradient',
+                                 '--trial-dir', path,
+                                 '--threshold', '0.001',
+                                 '--end-step', '3'])
+  assert rcode == 0
+  shutil.rmtree(path)
+
+def test_invoker_rule_pass_other_trials():
+  path1 = dump_data()
+  path2 = dump_data()
+  rcode = subprocess.check_call([sys.executable, '-m', 'tornasole.rules.rule_invoker',
+                                '--trial-dir', path1,
+                                '--other-trials', path2,
+                                '--rule-name', 'SimilarAcrossRuns',
+                                '--end-step', '3'])
+  assert rcode == 0
+  shutil.rmtree(path1)
+  shutil.rmtree(path2)
diff --git a/tests/analysis/rules/test_unchanged.py b/tests/analysis/rules/test_unchanged.py
@@ -0,0 +1,49 @@
+from tests.analysis.utils import generate_data
+
+from tornasole.rules.generic import UnchangedTensor
+from tornasole.trials import create_trial
+import uuid
+import numpy as np
+from tornasole.exceptions import *
+from tornasole.rules.rule_invoker import invoke_rule
+
+def test_unchanged():
+  run_id = str(uuid.uuid4())
+  base_path = 'ts_output/rule_invoker/'
+  path = base_path + run_id
+
+  num_tensors = 3
+
+  shape = (10, 3, 2)
+  generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
+                step=0, tname_prefix='foo', worker='algo-1', shape=shape,
+                data=np.ones(shape=shape))
+  generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
+                step=1, tname_prefix='foo', worker='algo-1', shape=shape,
+                data=np.ones(shape=shape))
+  generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
+                step=2, tname_prefix='foo', worker='algo-1', shape=shape,
+                data=np.ones(shape=shape))
+
+  generate_data(path=base_path, trial=run_id, num_tensors=num_tensors,
+                step=5, tname_prefix='boo', worker='algo-1', shape=shape,
+                data=np.ones(shape=shape))
+
+  tr = create_trial(path)
+  r = UnchangedTensor(tr, tensor_regex='.*')
+
+  invoke_rule(r, start_step=0, end_step=2, raise_eval_cond=True)
+
+  try:
+    invoke_rule(r, start_step=0, end_step=3, raise_eval_cond=True)
+    assert False
+  except RuleEvaluationConditionMet:
+    pass
+
+  try:
+    invoke_rule(r, start_step=2, end_step=3, raise_eval_cond=True)
+    assert False
+  except RuleEvaluationConditionMet:
+    pass
+
+  invoke_rule(r, start_step=3, end_step=6, raise_eval_cond=True)
diff --git a/tests/analysis/utils.py b/tests/analysis/utils.py
@@ -18,6 +18,8 @@ def generate_data(path, trial, step, tname_prefix,
         c = CollectionManager()
         c.add("default")
         c.get("default").tensor_names = ["foo_" + str(i) for i in range(num_tensors)]
+        c.add('gradients')
+        c.get("gradients").tensor_names = ["foo_" + str(i) for i in range(num_tensors)]
         c.export(os.path.join(path, trial, "collections.ts"))
 
 

diff --git a/tornasole/analysis/utils.py b/tornasole/analysis/utils.py
@@ -37,3 +37,34 @@ def refresh(trials):
     else:
         trial = trials
         trial.dynamic_refresh = False
+
+
+def parse_list_from_str(arg, delimiter=','):
+    """
+    :param arg: string or list of strings
+    if it is string it is treated as character delimited string
+    :param delimiter: string
+    if arg is a string, this delimiter is used to split the string
+    :return: list of strings
+    """
+    if arg is None:
+        rval = []
+    if isinstance(arg, str):
+        if len(arg) == 0:
+            rval = []
+        else:
+            rval = arg.split(delimiter)
+    return rval
+
+def parse_bool(arg, default):
+    if arg is None:
+        return default
+    elif arg in [False, True]:
+        return arg
+    elif arg == 'False':
+        return False
+    elif arg == 'True':
+        return True
+    else:
+        raise ValueError('boolean argument expected, '
+                         'but found {}'.format(arg))
diff --git a/tornasole/core/tensor.py b/tornasole/core/tensor.py
@@ -1,5 +1,6 @@
 from tornasole.core.reductions import get_numpy_reduction
 from tornasole.core.modes import ModeKeys
+import bisect
 from tornasole.exceptions import *
 
 from enum import Enum
@@ -223,3 +224,19 @@ def add_reduction_step(self, mode, mode_step, red_name, abs, red_value):
         self._mode_steps[mode].set_step_reduction_value(mode_step,
                                                         red_name, abs, red_value)
 
+    def prev_steps(self, step, n, mode=ModeKeys.GLOBAL):
+        """
+        returns n prev steps from step representing step number
+        of given mode
+        :param step: int
+        step number
+        :param n: int
+        number of previous steps to return
+        :param mode: value of the enum tornasole.modes
+        modes.GLOBAL, modes.TRAIN, modes.EVAL, modes.PREDICT
+        :return: a list of step numbers
+        """
+        steps = self.steps(mode=mode)
+        i = bisect.bisect_right(steps, step)
+        prev_steps = steps[:i]
+        return prev_steps[-n:]
diff --git a/tornasole/core/utils.py b/tornasole/core/utils.py
@@ -117,4 +117,4 @@ def index(sorted_list, elem):
     i = bisect.bisect_left(sorted_list, elem)
     if i != len(sorted_list) and sorted_list[i] == elem:
         return i
-    raise ValueError
+    raise ValueError
diff --git a/tornasole/rules/__init__.py b/tornasole/rules/__init__.py
@@ -1,2 +1 @@
 from .rule import RequiredTensors, Rule
-from .rule_invoker import invoke_rule
Original file line number	Diff line number	Diff line change
		@@ -1,2 +1 @@
		from .rule import RequiredTensors, Rule
		from .rule_invoker import invoke_rule