[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

ruotianluo · 2020-06-30T16:21:53Z

What does this PR do?

Use the learning rate scheduler from the official pytorch examples/imagenet
Add workers as an argument (instead of using 0)
Fix batch size, when use ddp as distributed backend.

Fixes #2422

pep8speaks · 2020-06-30T16:21:55Z

Hello @ruotianluo! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-08 18:14:30 UTC

ruotianluo · 2020-06-30T20:31:47Z

Not done yet. Fixing evaluation now.

Borda · 2020-06-30T20:55:53Z

Not done yet. Fixing evaluation now.

Mind add a test for the example, similar to here #2285 or create a small synthetic dataset and run a few steps...

ruotianluo · 2020-06-30T23:56:39Z

Will try.

ruotianluo · 2020-07-02T02:59:04Z

@Borda Not exactly sure how test should look like. I add one mimicking the commit you attached. Please advice.

Borda

Cool, this is what I had in mind, just resolve the image source...

pl_examples/test_examples.py

Borda

LGTM 🚀

Borda · 2020-07-02T16:21:07Z

it seems that the TPU build is not pushed to GKE... @zcain117

  env:
    PROJECT_ID: 
    GKE_CLUSTER: lightning-cluster
    GKE_ZONE: us-central1-a
    IMAGE: gcr.io//tpu-testing-image
    GOROOT: /opt/hostedtoolcache/go/1.14.4/x64
    CLOUDSDK_METRICS_ENVIRONMENT: github-actions-setup-gcloud
invalid argument "gcr.io//tpu-testing-image:155589184" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.
##[error]Process completed with exit code 125.

awaelchli · 2020-07-02T16:37:23Z

pl_examples/domain_templates/imagenet.py

+    def test_dataloader(self, *args, **kwargs):
+        return self.val_dataloader(*args, **kwargs)
+
+    def test_step(self, *args, **kwargs):
+        return self.validation_step(*args, **kwargs)
+
+    def test_epoch_end(self, *args, **kwargs):
+        return self.validation_epoch_end(*args, **kwargs)
+


It is not correct to directly redirect to validation methods here, because the metric names will refer to "val_...". This will affect the logging and progress bar display.
You can do it like this but you need to fetch the output and replace the key names to "test_...".

anyway, in v0.9 this will change with the new structured outputs:)

@awaelchli mind edit it?

The main reason I do this is to use the Trainer.test(). For imagenet, the evaluation is supposed to run on validation set, so the progress bar is fine.

I just want to make sure this example is solid and does not create a misunderstanding of what test and eval is.
You can always pass in the val dataloader to Trainer.test(), but what I mean is that the logged plots will look weird if you run validation during training and then at the end also run test, which will append the logs to the validation results if the have the same names.

Something like this?

outputs = self.validation_epoch_end(*args, **kwargs) outputs = {k.replace('val', 'test'):v for k,v in outputs.items()} return outputs

I pushed something.

pl_examples/test_examples.py

ruotianluo · 2020-07-02T17:14:09Z

pl_examples/test_examples.py

+                _make_image(os.path.join(tmpdir, split, class_id, str(image_id)+'.JPEG'))
+
+    cli_args = cli_args.split(' ') if cli_args else []
+    cli_args += ['--data-path', tmpdir]


I have to change to cli_args += ['--data-path', str(tmpdir)] to pass python -m pytest pl_examples/test_examples.py.

@Borda I changed it in the recent commit so that I can pass the test locally, feel free to change it back.

codecov · 2020-07-02T18:05:28Z

Codecov Report

Merging #2432 into imagenet_example will increase coverage by 32%.
The diff coverage is n/a.

@@                Coverage Diff                 @@
##           imagenet_example   #2432     +/-   ##
==================================================
+ Coverage                59%     90%    +32%     
==================================================
  Files                    79      79             
  Lines                  7152    7239     +87     
==================================================
+ Hits                   4196    6533   +2337     
+ Misses                 2956     706   -2250

mergify · 2020-07-03T17:24:04Z

This pull request is now in conflict... :(

mergify · 2020-07-05T23:58:27Z

This pull request is now in conflict... :(

mergify · 2020-07-09T11:17:03Z

This pull request is now in conflict... :(

mergify · 2020-07-14T18:23:09Z

This pull request is now in conflict... :(

pl_examples/domain_templates/imagenet.py

awaelchli · 2020-08-08T16:23:42Z

finishing the PR here in #2889 with your commits added. thanks for your patience :)

Borda · 2020-08-08T17:05:23Z

@awaelchli a better way is to merge this PR to your continuous branch...

mergify · 2020-08-08T18:11:19Z

This pull request is now in conflict... :(

* fix imagenet example: lr_scheduler, loader workers, batch size when ddp * Fix evaluation for imagenet example * add imagenet example test * cleanup * gpu * add imagenet example evluation test * fix test output * test is fixed in master, remove unecessary hack * CHANGE * Apply suggestions from code review * image net example * update imagenet example * update example * pep * imports * type hint * docs * obsolete arg * [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp (#2432) * fix imagenet example: lr_scheduler, loader workers, batch size when ddp * Fix evaluation for imagenet example * add imagenet example test * cleanup * gpu * add imagenet example evluation test * fix test output * test is fixed in master, remove unecessary hack * CHANGE * Apply suggestions from code review Co-authored-by: Jirka <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> * update chlog * add missing chlog * pep * pep Co-authored-by: Ruotian Luo <[email protected]> Co-authored-by: Jirka <[email protected]>

mergify bot requested a review from a team June 30, 2020 16:22

ruotianluo force-pushed the imagenet_example branch from 40a3f2f to 94f28bd Compare June 30, 2020 18:15

Borda changed the title ~~fix imagenet example: lr_scheduler, loader workers, batch size when ddp~~ [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jun 30, 2020

Borda requested changes Jul 2, 2020

View reviewed changes

pl_examples/test_examples.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team July 2, 2020 11:03

ruotianluo force-pushed the imagenet_example branch from f7e80a6 to cbf87d5 Compare July 2, 2020 15:05

Borda added bug Something isn't working ci Continuous Integration labels Jul 2, 2020

Borda changed the title ~~[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp~~ fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jul 2, 2020

Borda force-pushed the imagenet_example branch from 3158efd to 51e04e2 Compare July 2, 2020 16:14

Borda approved these changes Jul 2, 2020

View reviewed changes

mergify bot requested a review from a team July 2, 2020 16:14

awaelchli reviewed Jul 2, 2020

View reviewed changes

mergify bot requested a review from a team July 2, 2020 16:42

ruotianluo commented Jul 2, 2020

View reviewed changes

Borda changed the title ~~fix imagenet example: lr_scheduler, loader workers, batch size when ddp~~ [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jul 2, 2020

ruotianluo force-pushed the imagenet_example branch 2 times, most recently from 22d43ab to af79cf2 Compare July 5, 2020 21:37

ruotianluo force-pushed the imagenet_example branch from af79cf2 to e9172f2 Compare July 7, 2020 17:44

fix imagenet example: lr_scheduler, loader workers, batch size when ddp

8b5c1a2

ruotianluo and others added 8 commits July 12, 2020 15:17

Fix evaluation for imagenet example

10bb4bb

add imagenet example test

3c0975f

cleanup

8498fa4

gpu

e3ba4e1

add imagenet example evluation test

31585c3

fix test output

385f75a

test is fixed in master, remove unecessary hack

c88ec19

CHANGE

c968633

ruotianluo force-pushed the imagenet_example branch from e9172f2 to c968633 Compare July 12, 2020 22:33

Borda added this to the 0.9.0 milestone Aug 6, 2020

Merge branch 'master' into imagenet_example

aee4af4

awaelchli reviewed Aug 8, 2020

View reviewed changes

pl_examples/domain_templates/imagenet.py Outdated Show resolved Hide resolved

pl_examples/domain_templates/imagenet.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team August 8, 2020 15:05

Apply suggestions from code review

cd1e476

awaelchli mentioned this pull request Aug 8, 2020

Finish PR #2432: Imagenet example updates + basic testing #2889

Merged

awaelchli closed this Aug 8, 2020

Borda reopened this Aug 8, 2020

awaelchli changed the base branch from master to imagenet_example August 8, 2020 18:10

Merge branch 'imagenet_example' into imagenet_example

40f4af9

awaelchli merged commit 13c73f1 into Lightning-AI:imagenet_example Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

ruotianluo commented Jun 30, 2020

pep8speaks commented Jun 30, 2020 •

edited

Loading

ruotianluo commented Jun 30, 2020

Borda commented Jun 30, 2020

ruotianluo commented Jun 30, 2020 •

edited

Loading

ruotianluo commented Jul 2, 2020

Borda left a comment

Borda left a comment

Borda commented Jul 2, 2020

awaelchli Jul 2, 2020

awaelchli Jul 2, 2020

Borda Jul 2, 2020

ruotianluo Jul 2, 2020

awaelchli Jul 2, 2020

ruotianluo Jul 2, 2020

ruotianluo Jul 2, 2020

ruotianluo Jul 2, 2020

ruotianluo Jul 2, 2020

codecov bot commented Jul 2, 2020 •

edited

Loading

mergify bot commented Jul 3, 2020

mergify bot commented Jul 5, 2020

mergify bot commented Jul 9, 2020

mergify bot commented Jul 14, 2020

awaelchli commented Aug 8, 2020

Borda commented Aug 8, 2020

mergify bot commented Aug 8, 2020

[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

Conversation

ruotianluo commented Jun 30, 2020

What does this PR do?

pep8speaks commented Jun 30, 2020 • edited Loading

Comment last updated at 2020-08-08 18:14:30 UTC

ruotianluo commented Jun 30, 2020

Borda commented Jun 30, 2020

ruotianluo commented Jun 30, 2020 • edited Loading

ruotianluo commented Jul 2, 2020

Borda left a comment

Choose a reason for hiding this comment

Borda left a comment

Choose a reason for hiding this comment

Borda commented Jul 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 2, 2020 • edited Loading

Codecov Report

mergify bot commented Jul 3, 2020

mergify bot commented Jul 5, 2020

mergify bot commented Jul 9, 2020

mergify bot commented Jul 14, 2020

awaelchli commented Aug 8, 2020

Borda commented Aug 8, 2020

mergify bot commented Aug 8, 2020

pep8speaks commented Jun 30, 2020 •

edited

Loading

ruotianluo commented Jun 30, 2020 •

edited

Loading

codecov bot commented Jul 2, 2020 •

edited

Loading