Training not storing best model #460

VJatla · 2020-12-18T17:33:05Z

Hello,

I am trying to use mmaction2 to train on my custom dataset. I am able to train i3d, slowfast, slowonly and TSN.

Due to limitation in hard drive space I am not storing all the epochs. For example I created a check point file for every 3 epochs and the best model is at epoch 2. The epoch 2 checkpoint is not created. Is there anything I can do to store the best eopoch checkpoints even tho I write checkpoints for every 3 epochs, checkpoint_config = dict(interval=3)?

Please let me know if this is possible.

The text was updated successfully, but these errors were encountered:

irvingzhang0512 · 2020-12-18T18:11:08Z

I don't think mmcv & mmaction2 support this for now. But I implement a quick version of this(haven't test yet), please check here

Don't forget to set best_ckpt_name in config file, for example

evaluation = dict(
    interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], best_ckpt_name='best.pth')

innerlee · 2020-12-19T00:33:34Z

In current version, the best one could do is to use save interval=1, and manually clean bad ckpts periodically.

It would be good to implement max_num_of_ckpt_to_keep-like logic.

VJatla · 2020-12-19T04:51:40Z

Thank you. I will, for now, use the @innerlee solution. I would really like to use the @irvingzhang0512 solution. Can this be tested before I can adopt?

Thank you once again.

irvingzhang0512 · 2020-12-19T05:36:52Z

Thank you. I will, for now, use the @innerlee solution. I would really like to use the @irvingzhang0512 solution. Can this be tested before I can adopt?

Thank you once again.

I will test related codes and create a pr on Monday.

irvingzhang0512 · 2020-12-19T06:10:07Z

By the way, if you want set maximum checkpoints to keep, you can set max_keep_ckpts, for example, checkpoint_config = dict(interval=1, max_keep_ckpts=10)

VJatla · 2020-12-26T02:12:51Z

Assuming I keep maximum checkpoints to be 10, and number of epochs to be 100. If I get best validation accuracy at epoch 20. I don't think the current mmaction2 will store that eopch.

I will go with innerlee. Delete everything except best manually for now.

Thank you all. I am closing the issue.

irvingzhang0512 · 2020-12-26T08:22:47Z

@VJatla Actually, I've tested #464 and you can

step 1: replace mmaction/core/evaluation/eval_hooks.py with [Improvement] save best ckpt during training #464 's version.
step 2 : add save_best_ckpt in evaluation config, for example evaluation = dict(interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], save_best_ckpt=True)

#464 is closed because eval hook will be refactored by #395. Hopefully #395 could fix your issue.

irvingzhang0512 mentioned this issue Dec 19, 2020

[Improvement] save best ckpt during training #464

Closed

4 tasks

VJatla closed this as completed Dec 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training not storing best model #460

Training not storing best model #460

VJatla commented Dec 18, 2020 •

edited

Loading

irvingzhang0512 commented Dec 18, 2020 •

edited

Loading

innerlee commented Dec 19, 2020

VJatla commented Dec 19, 2020

irvingzhang0512 commented Dec 19, 2020 •

edited

Loading

irvingzhang0512 commented Dec 19, 2020

VJatla commented Dec 26, 2020

irvingzhang0512 commented Dec 26, 2020

Training not storing best model #460

Training not storing best model #460

Comments

VJatla commented Dec 18, 2020 • edited Loading

irvingzhang0512 commented Dec 18, 2020 • edited Loading

innerlee commented Dec 19, 2020

VJatla commented Dec 19, 2020

irvingzhang0512 commented Dec 19, 2020 • edited Loading

irvingzhang0512 commented Dec 19, 2020

VJatla commented Dec 26, 2020

irvingzhang0512 commented Dec 26, 2020

VJatla commented Dec 18, 2020 •

edited

Loading

irvingzhang0512 commented Dec 18, 2020 •

edited

Loading

irvingzhang0512 commented Dec 19, 2020 •

edited

Loading