Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training not storing best model #460

Closed
VJatla opened this issue Dec 18, 2020 · 7 comments
Closed

Training not storing best model #460

VJatla opened this issue Dec 18, 2020 · 7 comments

Comments

@VJatla
Copy link

VJatla commented Dec 18, 2020

Hello,

I am trying to use mmaction2 to train on my custom dataset. I am able to train i3d, slowfast, slowonly and TSN.

Due to limitation in hard drive space I am not storing all the epochs. For example I created a check point file for every 3 epochs and the best model is at epoch 2. The epoch 2 checkpoint is not created. Is there anything I can do to store the best eopoch checkpoints even tho I write checkpoints for every 3 epochs, checkpoint_config = dict(interval=3)?

Please let me know if this is possible.

@irvingzhang0512
Copy link
Contributor

irvingzhang0512 commented Dec 18, 2020

I don't think mmcv & mmaction2 support this for now. But I implement a quick version of this(haven't test yet), please check here

Don't forget to set best_ckpt_name in config file, for example

evaluation = dict(
    interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], best_ckpt_name='best.pth')

@innerlee
Copy link
Contributor

In current version, the best one could do is to use save interval=1, and manually clean bad ckpts periodically.

It would be good to implement max_num_of_ckpt_to_keep-like logic.

@VJatla
Copy link
Author

VJatla commented Dec 19, 2020

Thank you. I will, for now, use the @innerlee solution. I would really like to use the @irvingzhang0512 solution. Can this be tested before I can adopt?

Thank you once again.

@irvingzhang0512
Copy link
Contributor

irvingzhang0512 commented Dec 19, 2020

Thank you. I will, for now, use the @innerlee solution. I would really like to use the @irvingzhang0512 solution. Can this be tested before I can adopt?

Thank you once again.

I will test related codes and create a pr on Monday.

@irvingzhang0512
Copy link
Contributor

By the way, if you want set maximum checkpoints to keep, you can set max_keep_ckpts, for example, checkpoint_config = dict(interval=1, max_keep_ckpts=10)

@VJatla
Copy link
Author

VJatla commented Dec 26, 2020

Assuming I keep maximum checkpoints to be 10, and number of epochs to be 100. If I get best validation accuracy at epoch 20. I don't think the current mmaction2 will store that eopch.

I will go with innerlee. Delete everything except best manually for now.

Thank you all. I am closing the issue.

@VJatla VJatla closed this as completed Dec 26, 2020
@irvingzhang0512
Copy link
Contributor

@VJatla Actually, I've tested #464 and you can

  • step 1: replace mmaction/core/evaluation/eval_hooks.py with [Improvement] save best ckpt during training #464 's version.
  • step 2 : add save_best_ckpt in evaluation config, for example evaluation = dict(interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy'], save_best_ckpt=True)

#464 is closed because eval hook will be refactored by #395. Hopefully #395 could fix your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants