Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp #2432

Merged
merged 12 commits into from
Aug 8, 2020

Conversation

ruotianluo
Copy link
Contributor

What does this PR do?

Use the learning rate scheduler from the official pytorch examples/imagenet
Add workers as an argument (instead of using 0)
Fix batch size, when use ddp as distributed backend.

Fixes #2422

@pep8speaks
Copy link

pep8speaks commented Jun 30, 2020

Hello @ruotianluo! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-08 18:14:30 UTC

@mergify mergify bot requested a review from a team June 30, 2020 16:22
@ruotianluo
Copy link
Contributor Author

Not done yet. Fixing evaluation now.

@Borda Borda changed the title fix imagenet example: lr_scheduler, loader workers, batch size when ddp [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jun 30, 2020
@Borda
Copy link
Member

Borda commented Jun 30, 2020

Not done yet. Fixing evaluation now.

Mind add a test for the example, similar to here #2285 or create a small synthetic dataset and run a few steps...

@ruotianluo
Copy link
Contributor Author

ruotianluo commented Jun 30, 2020

Will try.

@ruotianluo
Copy link
Contributor Author

@Borda Not exactly sure how test should look like. I add one mimicking the commit you attached. Please advice.

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is what I had in mind, just resolve the image source...

@mergify mergify bot requested a review from a team July 2, 2020 11:03
@Borda Borda added bug Something isn't working ci Continuous Integration labels Jul 2, 2020
@Borda Borda changed the title [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jul 2, 2020
@Borda Borda force-pushed the imagenet_example branch from 3158efd to 51e04e2 Compare July 2, 2020 16:14
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@mergify mergify bot requested a review from a team July 2, 2020 16:14
@Borda
Copy link
Member

Borda commented Jul 2, 2020

it seems that the TPU build is not pushed to GKE... @zcain117

  env:
    PROJECT_ID: 
    GKE_CLUSTER: lightning-cluster
    GKE_ZONE: us-central1-a
    IMAGE: gcr.io//tpu-testing-image
    GOROOT: /opt/hostedtoolcache/go/1.14.4/x64
    CLOUDSDK_METRICS_ENVIRONMENT: github-actions-setup-gcloud
invalid argument "gcr.io//tpu-testing-image:155589184" for "-t, --tag" flag: invalid reference format
See 'docker build --help'.
##[error]Process completed with exit code 125.

Comment on lines 183 to 198
def test_dataloader(self, *args, **kwargs):
return self.val_dataloader(*args, **kwargs)

def test_step(self, *args, **kwargs):
return self.validation_step(*args, **kwargs)

def test_epoch_end(self, *args, **kwargs):
return self.validation_epoch_end(*args, **kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not correct to directly redirect to validation methods here, because the metric names will refer to "val_...". This will affect the logging and progress bar display.
You can do it like this but you need to fetch the output and replace the key names to "test_...".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, in v0.9 this will change with the new structured outputs:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli mind edit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I do this is to use the Trainer.test(). For imagenet, the evaluation is supposed to run on validation set, so the progress bar is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure this example is solid and does not create a misunderstanding of what test and eval is.
You can always pass in the val dataloader to Trainer.test(), but what I mean is that the logged plots will look weird if you run validation during training and then at the end also run test, which will append the logs to the validation results if the have the same names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

outputs = self.validation_epoch_end(*args, **kwargs)
outputs = {k.replace('val', 'test'):v for k,v in outputs.items()}
return outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed something.

@mergify mergify bot requested a review from a team July 2, 2020 16:42
_make_image(os.path.join(tmpdir, split, class_id, str(image_id)+'.JPEG'))

cli_args = cli_args.split(' ') if cli_args else []
cli_args += ['--data-path', tmpdir]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to change to cli_args += ['--data-path', str(tmpdir)] to pass python -m pytest pl_examples/test_examples.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda I changed it in the recent commit so that I can pass the test locally, feel free to change it back.

@Borda Borda changed the title fix imagenet example: lr_scheduler, loader workers, batch size when ddp [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp Jul 2, 2020
@codecov
Copy link

codecov bot commented Jul 2, 2020

Codecov Report

Merging #2432 into imagenet_example will increase coverage by 32%.
The diff coverage is n/a.

@@                Coverage Diff                 @@
##           imagenet_example   #2432     +/-   ##
==================================================
+ Coverage                59%     90%    +32%     
==================================================
  Files                    79      79             
  Lines                  7152    7239     +87     
==================================================
+ Hits                   4196    6533   +2337     
+ Misses                 2956     706   -2250     

@mergify
Copy link
Contributor

mergify bot commented Jul 3, 2020

This pull request is now in conflict... :(

@ruotianluo ruotianluo force-pushed the imagenet_example branch 2 times, most recently from 22d43ab to af79cf2 Compare July 5, 2020 21:37
@mergify
Copy link
Contributor

mergify bot commented Jul 5, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jul 9, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jul 14, 2020

This pull request is now in conflict... :(

@Borda Borda added this to the 0.9.0 milestone Aug 6, 2020
@mergify mergify bot requested a review from a team August 8, 2020 15:05
@awaelchli
Copy link
Contributor

finishing the PR here in #2889 with your commits added. thanks for your patience :)

@awaelchli awaelchli closed this Aug 8, 2020
@Borda
Copy link
Member

Borda commented Aug 8, 2020

@awaelchli a better way is to merge this PR to your continuous branch...

@Borda Borda reopened this Aug 8, 2020
@awaelchli awaelchli changed the base branch from master to imagenet_example August 8, 2020 18:10
@mergify
Copy link
Contributor

mergify bot commented Aug 8, 2020

This pull request is now in conflict... :(

@awaelchli awaelchli merged commit 13c73f1 into Lightning-AI:imagenet_example Aug 8, 2020
williamFalcon pushed a commit that referenced this pull request Aug 9, 2020
* fix imagenet example: lr_scheduler, loader workers, batch size when ddp

* Fix evaluation for imagenet example

* add imagenet example test

* cleanup

* gpu

* add imagenet example evluation test

* fix test output

* test is fixed in master, remove unecessary hack

* CHANGE

* Apply suggestions from code review

* image net example

* update imagenet example

* update example

* pep

* imports

* type hint

* docs

* obsolete arg

* [wip] fix imagenet example: lr_scheduler, loader workers, batch size when ddp (#2432)

* fix imagenet example: lr_scheduler, loader workers, batch size when ddp

* Fix evaluation for imagenet example

* add imagenet example test

* cleanup

* gpu

* add imagenet example evluation test

* fix test output

* test is fixed in master, remove unecessary hack

* CHANGE

* Apply suggestions from code review

Co-authored-by: Jirka <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>

* update chlog

* add missing chlog

* pep

* pep

Co-authored-by: Ruotian Luo <[email protected]>
Co-authored-by: Jirka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Imagenet example use num_workers 0
4 participants