Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] already set min_validation_period , but still got a single validation metric #9863

Closed
scotthuang1989 opened this issue Aug 24, 2024 · 4 comments
Labels

Comments

@scotthuang1989
Copy link

Describe the bug

here is my training configuration:

records_per_epoch: 1000000
min_validation_period:
batches: 1000

the training has run for 10430 batches. but validation only run 1 times at the end of training.

Reproduction Steps

Expected Behavior

validation run every 1000 batches.

Screenshot

image

Environment

  • Device or hardware: [e.g. iPhone6, Nvidia A100]
  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [0.35.0]

Additional Context

No response

@ioga
Copy link
Contributor

ioga commented Aug 24, 2024

hello,

any chance you can provide a repro code? that'd be an ideal way to track this down.

otherwise, we may only guess, and we'll need more details for that:

  1. which trial or trainer class are you using (PyTorchTrial, TFKerasTrial, torch trainer, Core API, etc.)
  2. dataloader details

@scotthuang1989
Copy link
Author

sorry for the delay. I am unable to reproduce this on a public dataset. I attach my training log, hope it helps.
and I will continue to reproduce it, thanks.
experiment_83_trial_386_logs.txt

@ioga
Copy link
Contributor

ioga commented Aug 28, 2024

that log is not particularly helpful in answering the questions. with some dataloader setups, the actual dataset size matters for the behavior.

the log has a weird message though: [2024-08-28T02:21:48.687318Z] e2563a43 || val dataset is None, use train data

@scotthuang1989
Copy link
Author

Hi ioga,

I found the cause of this issue.
I use following code to start the experiments

with pytorch.init(hparams=hparams) as train_context:
        trial = PredictionTrial(train_context, hparams=hparams)
        trainer = pytorch.Trainer(trial, train_context)
        trainer.fit(max_length=max_length, latest_checkpoint=latest_checkpoint)

I need pass the min_checkpoint_period to "trainer.fit" manually, determinedai will not do for me.

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants