-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a testing framework for example scripts and fix current ones #313
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever way to ensure we test on a small dataset!
Last step should be to create a job to run those :-)
examples/by_feature/checkpointing.py
Outdated
if isinstance(checkpointing_steps, int): | ||
overall_step += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not always do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only needed if we save via checkpointing steps rather than epoch, so didn't want people to assume we always need to do that.
(Which means a comment is needed!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I get the variable won't be used if we don't checkpoint with steps, but it doesn't hurt to always have it (and would save one line of code).
I've come up with a solution to make sure that also the main (Note: ======================================================================= short test summary info =======================================================================
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 14 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 27 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 4 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
=============================================================== 6 failed, 4 passed, 1 warning in 1.64s ================================================================ {{ Removed full trace, see the following message for an example }} |
Slightly tweaked how they look, I've now included it posting the source code diffs. I believe this is a must as it will tell the user exactly what parts were missing from the full example: ============================================================================== FAILURES ===============================================================================
_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________
self = <test_examples.ExampleDifferenceTests testMethod=test_cv_example>, complete_file_name = 'complete_cv_example.py', parser_only = False
def one_complete_example(self, complete_file_name: str, parser_only: bool):
"""
Tests a single `complete` example against all of the implemented `by_feature` scripts
Args:
complete_file_name (`str`):
The filename of a complete example
parser_only (`bool`):
Whether to look at the main training function, or the argument parser
"""
self.maxDiff = None
by_feature_path = os.path.abspath(os.path.join("examples", "by_feature"))
examples_path = os.path.abspath("examples")
for item in os.listdir(by_feature_path):
item_path = os.path.join(by_feature_path, item)
if os.path.isfile(item_path) and ".py" in item_path:
with self.subTest(tested_script=complete_file_name, feature_script=item, tested_section="main()" if parser_only else "training_function()"):
diff = compare_against_test(
os.path.join(examples_path, "nlp_example.py"),
os.path.join(examples_path, complete_file_name),
item_path,
parser_only
)
> self.assertEqual('\n'.join(diff), '')
E AssertionError: ' accelerator = Accelerator(cpu=arg[693 chars]()\n' != ''
E - accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all")
E -
E - accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
E -
E - accelerator.init_trackers("nlp_example", config)
E -
E - predictions, references = accelerator.gather((predictions, batch["labels"]))
E -
E - predictions=predictions,
E -
E - references=references,
E -
E - accelerator.log(
E -
E - {
E -
E - "accuracy": eval_metric["accuracy"],
E -
E - "f1": eval_metric["f1"],
E -
E - "train_loss": total_loss,
E -
E - "epoch": epoch,
E -
E - }
E -
E - accelerator.end_training()
tests/test_examples.py:107: AssertionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand all the new tests you added but it looks nice 😅
self.one_complete_example("complete_nlp_example.py", True) | ||
self.one_complete_example("complete_nlp_example.py", False) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we always use the method with both flags, should we just remove that arg and put the two tests inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm making a note for this inside of the documentation, but the reasoning for separation is it lets the test failure be more readable as to what section failed, rather than one complete error.
Notice the tested_section
part
E.g.:
_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________
vs:
_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='main()') _________
(This is a pytest-subtest
hack)
Co-authored-by: Sylvain Gugger <[email protected]>
Add tests for example scripts, and fix current ones
What does this add?
Since the
DataLoaders
are seperated out to a function in theby_feature
tests, we can write out tests for all of the examples to prove if they can run all the way through, and if they output the data we're interested in.These differ slightly from the
transformers
examples tests, as we don't get good performance out of these due to the defaults set in each script. Instead we focus on:Furthermore, this adds a rather complex way for us to see if the
complete_*_example.py
's are out of date or not. This is an automated process, that checks for diffs between the complete, the template, and one of theby_feature
examples. If so, it will output to the user that some lines were missing, prompting them to fix it (via pytest). An example error log can be seen in the subsequent messages.Who is it for?
Why is it needed?
I chose this complex automated process as very quickly we've sectioned out 3 areas of potential chaos when it comes to the scripts:
by_feature
scriptsBy adding in a check directly to our regular pytest tests, if any behavior gets changed or a new
by_feature
script gets added, we immediately know if acomplete_*
script should be changed and what needs to be added.As a result, the
complete_*
scripts now exactly mimic the feature scripts in their usage behaviors which is a good thing.These checks are also exceedingly quick (a second or two at most)
Basic usage examples:
When adding a new example to the
test_examples.py
script, if a new base script was made it should be added under a new test name, such astest_tabular_examples
and have two calls toone_complete_example
. The first should check for diffs in the main (argparse), the second in the raw training loop (training_function
). This looks like so:If you do have a new example that has a new base script (e.g
cv_example
andcomplete_cv_example
), the path to that base script should be passed in. Also, if there are any differences in the way things are logged, accuracies reported, etc, they should be added to a "special_strings" array to know that they should be ignoredAnticipated maintence burden? (What will happen in say, 3 months if something changes)
Most likely sometime in the next few months we will want to add in a functionality that seperates what features are checked in the complete examples vs what are not (such as a cross validatione xample).