🐛[bug] already set min_validation_period , but still got a single validation metric #9863

scotthuang1989 · 2024-08-24T05:04:06Z

Describe the bug

here is my training configuration:

records_per_epoch: 1000000
min_validation_period:
batches: 1000

the training has run for 10430 batches. but validation only run 1 times at the end of training.

Reproduction Steps

Expected Behavior

validation run every 1000 batches.

Screenshot

Environment

Device or hardware: [e.g. iPhone6, Nvidia A100]
OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [0.35.0]

Additional Context

No response

ioga · 2024-08-24T15:28:39Z

hello,

any chance you can provide a repro code? that'd be an ideal way to track this down.

otherwise, we may only guess, and we'll need more details for that:

which trial or trainer class are you using (PyTorchTrial, TFKerasTrial, torch trainer, Core API, etc.)
dataloader details

scotthuang1989 · 2024-08-28T10:05:48Z

sorry for the delay. I am unable to reproduce this on a public dataset. I attach my training log, hope it helps.
and I will continue to reproduce it, thanks.
experiment_83_trial_386_logs.txt

ioga · 2024-08-28T16:36:00Z

that log is not particularly helpful in answering the questions. with some dataloader setups, the actual dataset size matters for the behavior.

the log has a weird message though: [2024-08-28T02:21:48.687318Z] e2563a43 || val dataset is None, use train data

scotthuang1989 · 2024-09-02T02:35:16Z

Hi ioga,

I found the cause of this issue.
I use following code to start the experiments

with pytorch.init(hparams=hparams) as train_context:
        trial = PredictionTrial(train_context, hparams=hparams)
        trainer = pytorch.Trainer(trial, train_context)
        trainer.fit(max_length=max_length, latest_checkpoint=latest_checkpoint)

I need pass the min_checkpoint_period to "trainer.fit" manually, determinedai will not do for me.

thanks.

scotthuang1989 added the bug label Aug 24, 2024

scotthuang1989 closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[bug] already set min_validation_period , but still got a single validation metric #9863

🐛[bug] already set min_validation_period , but still got a single validation metric #9863

scotthuang1989 commented Aug 24, 2024

ioga commented Aug 24, 2024

scotthuang1989 commented Aug 28, 2024

ioga commented Aug 28, 2024

scotthuang1989 commented Sep 2, 2024

🐛[bug] already set min_validation_period , but still got a single validation metric #9863

🐛[bug] already set min_validation_period , but still got a single validation metric #9863

Comments

scotthuang1989 commented Aug 24, 2024

Describe the bug

Reproduction Steps

Expected Behavior

Screenshot

Environment

Additional Context

ioga commented Aug 24, 2024

scotthuang1989 commented Aug 28, 2024

ioga commented Aug 28, 2024

scotthuang1989 commented Sep 2, 2024