Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent validation dataloader reloading when using reload_dataloaders_every_n_epochs and check_val_every_n_epoch flags #10948

Closed
adamviola opened this issue Dec 6, 2021 · 7 comments · Fixed by #11036
Labels
bug Something isn't working data handling Generic data-related topic good first issue Good for newcomers priority: 1 Medium priority task

Comments

@adamviola
Copy link
Contributor

adamviola commented Dec 6, 2021

🐛 Bug

When the reload_dataloaders_every_n_epochs and check_val_every_n_epoch flags of the Trainer are used, the validation dataloader may reload inconsistently or not reload at all.

A few examples:

When reload_dataloaders_every_n_epochs = 2 and check_val_every_n_epoch = 3, the validation dataloader reloads every 6 epochs (should be 3).

When reload_dataloaders_every_n_epochs = 2 and check_val_every_n_epoch = 2, the validation dataloader never reloads.

To Reproduce

BoringModel Demo

Expected behavior

When reload_dataloaders_every_n_epochs<= check_val_every_n_epoch, I expect the validation dataloader to reload before each validation run.

Environment

  • CUDA:
    • GPU:
    • available: False
    • version: 11.1
  • Packages:
    • numpy: 1.19.5
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0+cu111
    • pytorch-lightning: 1.5.4
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.12
    • version: # 1 SMP Sat Jun 5 09:50:34 PDT 2021

Additional context

The culprit is probably _should_reload_dl_epoch. The logic not self.current_epoch % n_epochs doesn't consider when the validation dataloader was last reloaded.

cc @Borda @justusschock @awaelchli @ninginthecloud @tchaton

@adamviola adamviola added the bug Something isn't working label Dec 6, 2021
@tchaton tchaton added priority: 1 Medium priority task good first issue Good for newcomers labels Dec 6, 2021
@tchaton
Copy link
Contributor

tchaton commented Dec 6, 2021

Hey @adamviola,

Thanks for raising this issue. Would you be interested to contribute a fix?

Best,
T.C

@tchaton tchaton added the data handling Generic data-related topic label Dec 6, 2021
@adamviola
Copy link
Contributor Author

Sure, I can give it a go

@adamviola
Copy link
Contributor Author

When working on the fix, I ran into unexpected behavior when both the val_check_interval reload_dataloaders_every_n_epochs flags are used. I'm not sure whether the fix should keep this behavior.

Currently, the validation dataloader is reloaded for every validation check in epochs that are multiples of reload_dataloaders_every_n_epochs.

For example, suppose reload_dataloaders_every_n_epochs = 2 and val_check_interval = 0.3. The validation dataloader is reloaded for each validation check (3 times) in epochs 0, 2, 4, 6, etc, but it is not reloaded at all in epochs 1, 3, 5, 7, etc.

When reload_dataloaders_every_n_epochs = 1 and val_check_interval = 0.3, the validation dataloader reloads for every validation check.

I would expect the validation dataloader to only reload once every reload_dataloaders_every_n_epochs epochs. What do you think @tchaton?

@tchaton
Copy link
Contributor

tchaton commented Dec 7, 2021

Hey @adamviola,

Yes, the val data loader shouldn't be reloaded during the train dataloader epoch otherwise the metrics won't make sense.

Therefore, reload_dataloaders_every_n_epochs should be entirely independent from the val_check_interval value.

adamviola added a commit to adamviola/pytorch-lightning that referenced this issue Dec 11, 2021
adamviola added a commit to adamviola/pytorch-lightning that referenced this issue Dec 11, 2021
adamviola added a commit to adamviola/pytorch-lightning that referenced this issue Dec 11, 2021
adamviola added a commit to adamviola/pytorch-lightning that referenced this issue Dec 11, 2021
adamviola added a commit to adamviola/pytorch-lightning that referenced this issue Dec 20, 2021
@awaelchli
Copy link
Contributor

@adamviola thank you for contributing this fix!

@rohitgr7
Copy link
Contributor

When reload_dataloaders_every_n_epochs = 2 and check_val_every_n_epoch = 3, the validation dataloader reloads every 6 epochs (should be 3).

should it reload at 3 or 2? I think it should be just 2 and at epoch 3 it should use the recently loaded dataloaders.

@rohitgr7 rohitgr7 reopened this Mar 22, 2022
@adamviola
Copy link
Contributor Author

If reload_dataloaders_every_n_epochs < check_val_every_n_epoch, we should reload the validation dataloader every check_val_every_n_epoch epochs to prevent unnecessary reloads of the validation dataloader.

In the case you describe, we would reload the validation dataloader at epochs 0, 2, 4, 6, 8, 10, etc, but we check val at epochs 2, 5, 8, 11, etc. The validation dataloader created at epoch 6 is never used because it is reloaded at epoch 8 before validating at epoch 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data handling Generic data-related topic good first issue Good for newcomers priority: 1 Medium priority task
Projects
None yet
4 participants