What hyperparams do I need to tune when I want to continue a previous training? #9257

haimat · 2022-09-02T10:24:11Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Sometimes I want to continue training using the best.pt model from a previous YOLOv5 training run. However, everytime I do so after only 2 or 3 epochs in the new training the model performance drops down quite a bit, often even nearly down to below 0.1, even though it has been 0.5 in best.pt from the previous training.

I assume that is because of the learning rate being too high. But that way I loose nearly all the training work stored in best.pt, which is obviously not what I want. So I guess I need to tweak the hyperparams for the second training.

Could you please advice, what hyperparams in particular I would need to tweak, and in which direction (up or down), when I want to fine tune a model, i.e. continue from the best.pt file from a previous training session?

Additional

As an example, let's have a look at the following training performance, showing the mAP value of my model during 500 epochs:

Looking at the linear line it seems mAP performance of this model can be improved even further, let's say for another 500 training epochs. However, every time I continue training from best.pt of that training from the image above, within the first 3-5 epochs or so mAP drops down 0.05 or something like that, then it taks some 100s more epochs to get up again. In the end, after 500 training epochs, I am close to where I have been in the first training.

Thus I am basically starting again from the start and loosing many many training epochs. So how can I start from that good mAP value of the first training run and continue from there?

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2022-09-02T15:55:45Z

@haimat 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

haimat · 2022-09-02T16:25:01Z

@glenn-jocher Thanks, but this does not answer my question. I know about what you wrote, but what I don't know is how exactly the hyperparams influence the LR. As described in my use case, my question is: what hyperparams do I need to modify, and in whch way, if I want to do a 2nd training using the best.pt file from a previous training?

glenn-jocher · 2022-09-04T11:18:01Z

@haimat you don't need to modify anything, you can start a second training on any dataset from previously trained weights on any other dataset.

You can choose to experiment with hyperparameter variations, but of course I can't advise on this, the experimentation is on you. If you want an automated way of evolving hyperparameters see our Hyperparameter Evolution tutorial below.

If you're just asking how to modify LR these values are here:

yolov5/data/hyps/hyp.scratch-low.yaml

Lines 6 to 7 in 63ecce6

    
           lr0: 0.01  # initial learning rate (SGD=1E-2, Adam=1E-3) 
        
           lrf: 0.01  # final OneCycleLR learning rate (lr0 * lrf)

Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
ClearML Logging 🌟 NEW
Weights & Biases Logging
Roboflow for Datasets, Labeling, and Active Learning 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TFLite, ONNX, CoreML, TensorRT Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
Architecture Summary ⭐ NEW

Good luck 🍀 and let us know if you have any other questions!

haimat · 2022-09-04T17:59:04Z

@glenn-jocher Hi Glenn, thanks for your response. In particular I would be interested to know how the first few parameters influence training:

lr0: 0.01  # initial learning rate (SGD=1E-2, Adam=1E-3)
lrf: 0.01  # final OneCycleLR learning rate (lr0 * lrf)
momentum: 0.937  # SGD momentum/Adam beta1
weight_decay: 0.0005  # optimizer weight decay 5e-4
warmup_epochs: 3.0  # warmup epochs (fractions ok)
warmup_momentum: 0.8  # warmup initial momentum
warmup_bias_lr: 0.1  # warmup initial bias lr

I see their comments. but they are very brief. Is there some more documentation on them anywhere?

github-actions · 2022-10-05T00:35:07Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

glenn-jocher · 2023-11-15T09:38:56Z

@haimat Yes, the hyperparameters you provided play crucial roles in the training process. Here's a brief overview:

lr0: This is the initial learning rate, and its value largely depends on the optimizer being used. For Stochastic Gradient Descent (SGD), the typical value is 1E-2, while for Adam, it is 1E-3. This parameter determines the size of the steps taken to update the model's weights during training.
lrf: This is the final OneCycleLR learning rate, which is used in learning rate schedules. It is calculated as the product of lr0 and lrf.
momentum: For SGD, this parameter represents the momentum, which determines the contribution of the accumulated gradient in updating the weights. The typical value is 0.937 for SGD and 0.9 for Adam.
weight_decay: This is the optimizer weight decay, and it helps in preventing the model from overfitting to the training data. The typical value is 5e-4.
warmup_epochs: This parameter represents the number of warmup epochs. During warmup, the learning rate is gradually increased from a small value to its initial value. The fraction value is also acceptable.
warmup_momentum: This is the initial momentum used during warmup.
warmup_bias_lr: This parameter represents the initial bias learning rate during warmup.

For more details and advanced guidance on these hyperparameters and their effects on training, you can refer to our documentation for YOLOv5.

I hope this provides a clearer understanding of how these hyperparameters influence the training process. Let me know if you have any more questions!

haimat added the question Further information is requested label Sep 2, 2022

github-actions bot added the Stale Stale and schedule for closing soon label Oct 5, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What hyperparams do I need to tune when I want to continue a previous training? #9257

What hyperparams do I need to tune when I want to continue a previous training? #9257

haimat commented Sep 2, 2022

glenn-jocher commented Sep 2, 2022 •

edited

Loading

haimat commented Sep 2, 2022

glenn-jocher commented Sep 4, 2022 •

edited by UltralyticsAssistant

Loading

haimat commented Sep 4, 2022

github-actions bot commented Oct 5, 2022 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 15, 2023

What hyperparams do I need to tune when I want to continue a previous training? #9257

What hyperparams do I need to tune when I want to continue a previous training? #9257

Comments

haimat commented Sep 2, 2022

Search before asking

Question

Additional

glenn-jocher commented Sep 2, 2022 • edited Loading

Resume Single-GPU

Resume Multi-GPU

Start from Pretrained

haimat commented Sep 2, 2022

glenn-jocher commented Sep 4, 2022 • edited by UltralyticsAssistant Loading

Tutorials

haimat commented Sep 4, 2022

github-actions bot commented Oct 5, 2022 • edited by glenn-jocher Loading

glenn-jocher commented Nov 15, 2023

glenn-jocher commented Sep 2, 2022 •

edited

Loading

glenn-jocher commented Sep 4, 2022 •

edited by UltralyticsAssistant

Loading

github-actions bot commented Oct 5, 2022 •

edited by glenn-jocher

Loading