Create a testing framework for example scripts and fix current ones #313

muellerzr · 2022-04-12T01:07:03Z

Add tests for example scripts, and fix current ones

What does this add?

Since the DataLoaders are seperated out to a function in the by_feature tests, we can write out tests for all of the examples to prove if they can run all the way through, and if they output the data we're interested in.

These differ slightly from the transformers examples tests, as we don't get good performance out of these due to the defaults set in each script. Instead we focus on:

Can the script run
Does the specific feature we are looking for have the output we want. (such as a checkpoint file, evidence of a log, etc)

Furthermore, this adds a rather complex way for us to see if the complete_*_example.py's are out of date or not. This is an automated process, that checks for diffs between the complete, the template, and one of the by_feature examples. If so, it will output to the user that some lines were missing, prompting them to fix it (via pytest). An example error log can be seen in the subsequent messages.

Who is it for?

Maintainers, as a way for us to easily be able to check if our examples can be run and if any scripts become outdated
Users, as they have the piece of mind knowing that all the examples inside of our repo are up to date and work as intended

Why is it needed?

I chose this complex automated process as very quickly we've sectioned out 3 areas of potential chaos when it comes to the scripts:

The base scripts
The complete scripts
The by_feature scripts

By adding in a check directly to our regular pytest tests, if any behavior gets changed or a new by_feature script gets added, we immediately know if a complete_* script should be changed and what needs to be added.

As a result, the complete_* scripts now exactly mimic the feature scripts in their usage behaviors which is a good thing.

These checks are also exceedingly quick (a second or two at most)

Basic usage examples:

When adding a new example to the test_examples.py script, if a new base script was made it should be added under a new test name, such as test_tabular_examples and have two calls to one_complete_example. The first should check for diffs in the main (argparse), the second in the raw training loop (training_function). This looks like so:

def test_tabular_examples(self):
  self.one_complete_example("complete_tabular_example.py", True)
  self.one_complete_example("complete_tabular_example.py", False)

If you do have a new example that has a new base script (e.g cv_example and complete_cv_example), the path to that base script should be passed in. Also, if there are any differences in the way things are logged, accuracies reported, etc, they should be added to a "special_strings" array to know that they should be ignored

   def test_cv_examples(self):
        cv_path = os.path.abspath(os.path.join("examples", "cv_example.py"))
        special_strings = [
            " " * 16 + "{\n\n",
            " " * 18 + '"accuracy": eval_metric["accuracy"],\n\n',
            " " * 18 + '"f1": eval_metric["f1"],\n\n',
            " " * 18 + '"train_loss": total_loss,\n\n',
            " " * 18 + '"epoch": epoch,\n\n',
            " " * 16 + "}\n",
            " " * 8,
        ]
        self.one_complete_example("complete_cv_example.py", False, cv_path, special_strings)

Anticipated maintence burden? (What will happen in say, 3 months if something changes)

Most likely sometime in the next few months we will want to add in a functionality that seperates what features are checked in the complete examples vs what are not (such as a cross validatione xample).

HuggingFaceDocBuilderDev · 2022-04-12T01:18:29Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Clever way to ensure we test on a small dataset!
Last step should be to create a job to run those :-)

sgugger · 2022-04-12T13:07:09Z

examples/by_feature/checkpointing.py

+            if isinstance(checkpointing_steps, int):
+                overall_step += 1


Why not always do this?

It's only needed if we save via checkpointing steps rather than epoch, so didn't want people to assume we always need to do that.

(Which means a comment is needed!)

Yeah I get the variable won't be used if we don't checkpoint with steps, but it doesn't hurt to always have it (and would save one line of code).

muellerzr · 2022-04-12T16:58:36Z

I've come up with a solution to make sure that also the main complete_ examples will raise an error if they differ from the bits inside of by_feature. Now eventually it would be good to have an exclusion list of files for when we wouldn't expect to see some feature added (such as say cross validation). But we'll cross that bridge when we get there. In the meantime here is the report generated back to the user from pytest:

(Note: pytest-subtests was added as a dep to make sure we could use TestCase.subTest and make this code clean and efficient)

======================================================================= short test summary info =======================================================================
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 14 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_body - AssertionError: 27 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_cv_example_parser - AssertionError: 4 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
SUBFAIL tests/test_examples.py::ExampleDifferenceTests::test_complete_nlp_example_parser - AssertionError: 1 != 0
=============================================================== 6 failed, 4 passed, 1 warning in 1.64s ================================================================

{{ Removed full trace, see the following message for an example }}

muellerzr · 2022-04-12T17:28:15Z

Slightly tweaked how they look, I've now included it posting the source code diffs. I believe this is a must as it will tell the user exactly what parts were missing from the full example:

============================================================================== FAILURES ===============================================================================
_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________

self = <test_examples.ExampleDifferenceTests testMethod=test_cv_example>, complete_file_name = 'complete_cv_example.py', parser_only = False

    def one_complete_example(self, complete_file_name: str, parser_only: bool):
        """
        Tests a single `complete` example against all of the implemented `by_feature` scripts
    
        Args:
            complete_file_name (`str`):
                The filename of a complete example
            parser_only (`bool`):
                Whether to look at the main training function, or the argument parser
        """
        self.maxDiff = None
        by_feature_path = os.path.abspath(os.path.join("examples", "by_feature"))
        examples_path = os.path.abspath("examples")
        for item in os.listdir(by_feature_path):
            item_path = os.path.join(by_feature_path, item)
            if os.path.isfile(item_path) and ".py" in item_path:
                with self.subTest(tested_script=complete_file_name, feature_script=item, tested_section="main()" if parser_only else "training_function()"):
                    diff = compare_against_test(
                        os.path.join(examples_path, "nlp_example.py"),
                        os.path.join(examples_path, complete_file_name),
                        item_path,
                        parser_only
                    )
>                   self.assertEqual('\n'.join(diff), '')
E                   AssertionError: '        accelerator = Accelerator(cpu=arg[693 chars]()\n' != ''
E                   -         accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision, log_with="all")
E                   - 
E                   -         accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
E                   - 
E                   -         accelerator.init_trackers("nlp_example", config)
E                   - 
E                   -             predictions, references = accelerator.gather((predictions, batch["labels"]))
E                   - 
E                   -                 predictions=predictions,
E                   - 
E                   -                 references=references,
E                   - 
E                   -             accelerator.log(
E                   - 
E                   -                 {
E                   - 
E                   -                     "accuracy": eval_metric["accuracy"],
E                   - 
E                   -                     "f1": eval_metric["f1"],
E                   - 
E                   -                     "train_loss": total_loss,
E                   - 
E                   -                     "epoch": epoch,
E                   - 
E                   -                 }
E                   - 
E                   -         accelerator.end_training()

tests/test_examples.py:107: AssertionError

sgugger

Not sure I understand all the new tests you added but it looks nice 😅

src/accelerate/test_utils/examples.py

tests/test_examples.py

sgugger · 2022-04-13T14:55:37Z

tests/test_examples.py

+        self.one_complete_example("complete_nlp_example.py", True)
+        self.one_complete_example("complete_nlp_example.py", False)
+


It seems we always use the method with both flags, should we just remove that arg and put the two tests inside?

I'm making a note for this inside of the documentation, but the reasoning for separation is it lets the test failure be more readable as to what section failed, rather than one complete error.
Notice the tested_section part
E.g.:

_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='training_function()') _________

vs:

_________ ExampleDifferenceTests.test_cv_example (feature_script='tracking.py', tested_script='complete_cv_example.py', tested_section='main()') _________

(This is a pytest-subtest hack)

Co-authored-by: Sylvain Gugger <[email protected]>

…rate into example_tests

src/accelerate/test_utils/examples.py

muellerzr added 3 commits April 12, 2022 00:43

Finish examples and tests

21b3df9

Just check if they run

2f9dedb

Quality:

b4f6621

muellerzr added bug Something isn't working enhancement New feature or request labels Apr 12, 2022

Revert warmup steps

26a9196

muellerzr requested a review from sgugger April 12, 2022 01:07

muellerzr added 2 commits April 11, 2022 21:08

Keep batch to device

49a89e4

Revert help string

bccc7d4

muellerzr added 2 commits April 11, 2022 21:26

Add transformers and datasets to setup.py

d13ddc7

More deps

59b6988

muellerzr marked this pull request as draft April 12, 2022 12:49

sgugger reviewed Apr 12, 2022

View reviewed changes

muellerzr added 2 commits April 12, 2022 12:32

Add ability to check source code

a233a4a

Better parsing errors

f9dc393

muellerzr added 9 commits April 12, 2022 15:22

Finish tests + updating scripts

481b68b

Style + docs

06199c4

Fin

ca14736

log level

a16fef1

Missing test

6efc788

Add log

9b54600

Should be green

9012752

Green on local?

798b1e6

Add just tensorboard

b9b513a

muellerzr marked this pull request as ready for review April 12, 2022 21:59

sgugger approved these changes Apr 13, 2022

View reviewed changes

Apply suggestions from code review

f62fb23

Co-authored-by: Sylvain Gugger <[email protected]>

muellerzr added 3 commits April 13, 2022 11:29

Start of further docs

ee8031f

Merge branch 'example_tests' of https://github.com/huggingface/accele…

9dacc45

…rate into example_tests

Document + refactor

16d5b44

muellerzr requested a review from sgugger April 13, 2022 16:34

muellerzr added 2 commits April 13, 2022 12:35

Remove log-cli-level

6455181

Canonical spelling

0733d77

sgugger approved these changes Apr 13, 2022

View reviewed changes

src/accelerate/test_utils/examples.py Outdated Show resolved Hide resolved

muellerzr added 2 commits April 13, 2022 13:17

Comments -> quote strings

1a6aae3

style

2b0ef46

muellerzr changed the title ~~Add tests for example scripts + nits~~ Create a testing framework for example scripts and fix current ones Apr 13, 2022

rm unused logger

41ef3c7

muellerzr merged commit 209db19 into main Apr 13, 2022

muellerzr deleted the example_tests branch April 13, 2022 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a testing framework for example scripts and fix current ones #313

Create a testing framework for example scripts and fix current ones #313

muellerzr commented Apr 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 12, 2022 •

edited

Loading

sgugger left a comment

sgugger Apr 12, 2022

muellerzr Apr 12, 2022

sgugger Apr 12, 2022

muellerzr commented Apr 12, 2022 •

edited

Loading

muellerzr commented Apr 12, 2022

sgugger left a comment

sgugger Apr 13, 2022

muellerzr Apr 13, 2022 •

edited

Loading

		self.one_complete_example("complete_nlp_example.py", True)
		self.one_complete_example("complete_nlp_example.py", False)

Create a testing framework for example scripts and fix current ones #313

Create a testing framework for example scripts and fix current ones #313

Conversation

muellerzr commented Apr 12, 2022 • edited Loading

Add tests for example scripts, and fix current ones

What does this add?

Who is it for?

Why is it needed?

Basic usage examples:

Anticipated maintence burden? (What will happen in say, 3 months if something changes)

HuggingFaceDocBuilderDev commented Apr 12, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 12, 2022

Choose a reason for hiding this comment

muellerzr Apr 12, 2022

Choose a reason for hiding this comment

sgugger Apr 12, 2022

Choose a reason for hiding this comment

muellerzr commented Apr 12, 2022 • edited Loading

muellerzr commented Apr 12, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 13, 2022

Choose a reason for hiding this comment

muellerzr Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

muellerzr commented Apr 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 12, 2022 •

edited

Loading

muellerzr commented Apr 12, 2022 •

edited

Loading

muellerzr Apr 13, 2022 •

edited

Loading