Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a testing framework for example scripts and fix current ones #313

Merged
merged 28 commits into from
Apr 13, 2022

Conversation

muellerzr
Copy link
Collaborator

@muellerzr muellerzr commented Apr 12, 2022

Add tests for example scripts, and fix current ones

What does this add?

Since the DataLoaders are seperated out to a function in the by_feature tests, we can write out tests for all of the examples to prove if they can run all the way through, and if they output the data we're interested in.

These differ slightly from the transformers examples tests, as we don't get good performance out of these due to the defaults set in each script. Instead we focus on:

  1. Can the script run
  2. Does the specific feature we are looking for have the output we want. (such as a checkpoint file, evidence of a log, etc)

Furthermore, this adds a rather complex way for us to see if the complete_*_example.py's are out of date or not. This is an automated process, that checks for diffs between the complete, the template, and one of the by_feature examples. If so, it will output to the user that some lines were missing, prompting them to fix it (via pytest). An example error log can be seen in the subsequent messages.

Who is it for?

  • Maintainers, as a way for us to easily be able to check if our examples can be run and if any scripts become outdated
  • Users, as they have the piece of mind knowing that all the examples inside of our repo are up to date and work as intended

Why is it needed?

I chose this complex automated process as very quickly we've sectioned out 3 areas of potential chaos when it comes to the scripts:

  • The base scripts
  • The complete scripts
  • The by_feature scripts

By adding in a check directly to our regular pytest tests, if any behavior gets changed or a new by_feature script gets added, we immediately know if a complete_* script should be changed and what needs to be added.

As a result, the complete_* scripts now exactly mimic the feature scripts in their usage behaviors which is a good thing.

These checks are also exceedingly quick (a second or two at most)

Basic usage examples:

When adding a new example to the test_examples.py script, if a new base script was made it should be added under a new test name, such as test_tabular_examples and have two calls to one_complete_example. The first should check for diffs in the main (argparse), the second in the raw training loop (training_function). This looks like so:

def test_tabular_examples(self):
  self.one_complete_example("complete_tabular_example.py", True)
  self.one_complete_example("complete_tabular_example.py", False)

If you do have a new example that has a new base script (e.g cv_example and complete_cv_example), the path to that base script should be passed in. Also, if there are any differences in the way things are logged, accuracies reported, etc, they should be added to a "special_strings" array to know that they should be ignored

   def test_cv_examples(self):
        cv_path = os.path.abspath(os.path.join("examples", "cv_example.py"))
        special_strings = [
            " " * 16 + "{\n\n",
            " " * 18 + '"accuracy": eval_metric["accuracy"],\n\n',
            " " * 18 + '"f1": eval_metric["f1"],\n\n',
            " " * 18 + '"train_loss": total_loss,\n\n',
            " " * 18 + '"epoch": epoch,\n\n',
            " " * 16 + "}\n",
            " " * 8,
        ]
        self.one_complete_example("complete_cv_example.py", False, cv_path, special_strings)

Anticipated maintence burden? (What will happen in say, 3 months if something changes)

Most likely sometime in the next few months we will want to add in a functionality that seperates what features are checked in the complete examples vs what are not (such as a cross validatione xample).

@muellerzr muellerzr added bug Something isn't working enhancement New feature or request labels Apr 12, 2022
@muellerzr muellerzr requested a review from sgugger April 12, 2022 01:07
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 12, 2022

The documentation is not available anymore as the PR was closed or merged.

@muellerzr muellerzr marked this pull request as draft April 12, 2022 12:49
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever way to ensure we test on a small dataset!
Last step should be to create a job to run those :-)

Comment on lines 203 to 204
if isinstance(checkpointing_steps, int):
overall_step += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only needed if we save via checkpointing steps rather than epoch, so didn't want people to assume we always need to do that.

(Which means a comment is needed!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get the variable won't be used if we don't checkpoint with steps, but it doesn't hurt to always have it (and would save one line of code).

@muellerzr
Copy link
Collaborator Author

muellerzr commented Apr 12, 2022

I've come up with a solution to make sure that also the main complete_ examples will raise an error if they differ from the bits inside of by_feature. Now eventually it would be good to have an exclusion list of files for when we wouldn't expect to see some feature added (such as say cross validation). But we'll cross that bridge when we get there. In the meantime here is the report generated back to the user from pytest:

(Note: pytest-subtests was added as a dep to make sure we could use TestCase.subTest and make this code clean and efficient)

======================================================================= short test summary info =======================================================================
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 14 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 27 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 4 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
=============================================================== 6 failed, 4 passed, 1 warning in 1.64s ================================================================

{{ Removed full trace, see the following message for an example }}

@muellerzr
Copy link
Collaborator Author

Slightly tweaked how they look, I've now included it posting the source code diffs. I believe this is a must as it will tell the user exactly what parts were missing from the full example:

============================================================================== FAILURES ===============================================================================
_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________

self = <test_examples.ExampleDifferenceTests testMethod=test_cv_example>, complete_file_name = 'complete_cv_example.py', parser_only = False

    def one_complete_example(self, complete_file_name: str, parser_only: bool):
        """
        Tests a single `complete` example against all of the implemented `by_feature` scripts
    
        Args:
            complete_file_name (`str`):
                The filename of a complete example
            parser_only (`bool`):
                Whether to look at the main training function, or the argument parser
        """
        self.maxDiff = None
        by_feature_path = os.path.abspath(os.path.join("examples", "by_feature"))
        examples_path = os.path.abspath("examples")
        for item in os.listdir(by_feature_path):
            item_path = os.path.join(by_feature_path, item)
            if os.path.isfile(item_path) and ".py" in item_path:
                with self.subTest(tested_script=complete_file_name, feature_script=item, tested_section="main()" if parser_only else "training_function()"):
                    diff = compare_against_test(
                        os.path.join(examples_path, "nlp_example.py"),
                        os.path.join(examples_path, complete_file_name),
                        item_path,
                        parser_only
                    )
>                   self.assertEqual('\n'.join(diff), '')
E                   AssertionError: '        accelerator = Accelerator(cpu=arg[693 chars]()\n' != ''
E                   -         accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all")
E                   - 
E                   -         accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
E                   - 
E                   -         accelerator.init_trackers("nlp_example", config)
E                   - 
E                   -             predictions, references = accelerator.gather((predictions, batch["labels"]))
E                   - 
E                   -                 predictions=predictions,
E                   - 
E                   -                 references=references,
E                   - 
E                   -             accelerator.log(
E                   - 
E                   -                 {
E                   - 
E                   -                     "accuracy": eval_metric["accuracy"],
E                   - 
E                   -                     "f1": eval_metric["f1"],
E                   - 
E                   -                     "train_loss": total_loss,
E                   - 
E                   -                     "epoch": epoch,
E                   - 
E                   -                 }
E                   - 
E                   -         accelerator.end_training()

tests/test_examples.py:107: AssertionError

@muellerzr muellerzr marked this pull request as ready for review April 12, 2022 21:59
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand all the new tests you added but it looks nice 😅

src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved
src/accelerate/test_utils/examples.py Show resolved Hide resolved
src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved
src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved
src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved
tests/test_examples.py Outdated Show resolved Hide resolved
Comment on lines +123 to +125
self.one_complete_example("complete_nlp_example.py", True)
self.one_complete_example("complete_nlp_example.py", False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we always use the method with both flags, should we just remove that arg and put the two tests inside?

Copy link
Collaborator Author

@muellerzr muellerzr Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making a note for this inside of the documentation, but the reasoning for separation is it lets the test failure be more readable as to what section failed, rather than one complete error.
Notice the tested_section part
E.g.:

_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________

vs:

_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='main()') _________

(This is a pytest-subtest hack)

@muellerzr muellerzr requested a review from sgugger April 13, 2022 16:34
src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved
@muellerzr muellerzr changed the title Add tests for example scripts + nits Create a testing framework for example scripts and fix current ones Apr 13, 2022
@muellerzr muellerzr merged commit 209db19 into main Apr 13, 2022
@muellerzr muellerzr deleted the example_tests branch April 13, 2022 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants