Fix learning rate gap on resume #9468

Laughing-q · 2024-04-01T12:01:06Z

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improvements in training scheduling and more informative logging.

📊 Key Changes

Optimizer step warning suppression moved and simplified within the training loop.
Scheduling based on time now occurs directly after an epoch ends rather than inside a warning suppression block.
Introduction of a more descriptive logging prefix (EarlyStopping: ) for better clarity on training halts due to lack of improvement.

🎯 Purpose & Impact

Enhanced Efficiency 🏃‍♂️: Moving the learning rate scheduler step outside of a warning suppression block streamlines the training process, potentially reducing confusion and improving code maintainability.
Time-Based Scheduling 🕒: Adjusting the scheduler after each epoch based on the average epoch duration and desired training time allows for more precise control over the training duration, making it easier to fit training into specific time slots.
Improved User Experience 📈: Introducing a colored "EarlyStopping:" prefix to logs makes it immediately clear when training has been halted due to no observed improvement, enhancing user understanding and control over the training process.

These changes aim to simplify the training loop for better performance, more precise time management, and clearer communication of important training events, all of which contribute to a more efficient and user-friendly training experience.

codecov · 2024-04-01T12:03:21Z

Codecov Report

Attention: Patch coverage is 45.45455% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 76.67%. Comparing base (e5f4f5c) to head (02a652e).

❗ Current head 02a652e differs from pull request most recent head e8cfdd6. Consider uploading reports for the commit e8cfdd6 to get more accurate results

Files	Patch %	Lines
ultralytics/engine/trainer.py	44.44%	5 Missing ⚠️
ultralytics/utils/torch_utils.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #9468       +/-   ##
===========================================
+ Coverage   37.99%   76.67%   +38.67%     
===========================================
  Files         121      120        -1     
  Lines       15277    15175      -102     
===========================================
+ Hits         5805    11635     +5830     
+ Misses       9472     3540     -5932

Flag	Coverage Δ
Benchmarks	`36.30% <9.09%> (?)`
GPU	`38.23% <45.45%> (+0.23%)`	⬆️
Tests	`71.93% <45.45%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Laughing-q · 2024-04-01T12:10:05Z

@glenn-jocher this PR fixed the lr issue when resuming. I tested a training locally with interrupted at epoch 5/10 then resume the training.

Also I tested time training by:

yolo detect train data=runs/data/coco.yaml time=0.01
yolo detect train data=runs/data/coco.yaml time=0.02
yolo detect train data=runs/data/coco.yaml time=0.05

For quick verification, I manually set the train set to coco/val2017 which contains 5000 images. And all these three training behave exactly the same before and after this fix.

yolo detect train data=runs/data/coco.yaml time=0.01 finished the training at the middle of the first epoch.
yolo detect train data=runs/data/coco.yaml time=0.02 finished the training right after 1 epoch.
yolo detect train data=runs/data/coco.yaml time=0.05 finished the training right after 2 epoch.

Laughing-q · 2024-04-01T12:10:32Z

@glenn-jocher please take a look and check if I break anything. Thanks!

glenn-jocher · 2024-04-01T14:17:50Z

@Laughing-q that would be super nice we could get rid of all that timed training logic, but I think I had to include it for a reason. One test for timed training is we need to run with less time and more time than the default epochs count. I'll run some tests on this.

glenn-jocher · 2024-04-01T14:19:35Z

Oh I think I remember now, we need to update the LR scheduler to hit lrf on epochs epoch, I think this logic handles that as epochs keeps updating during the training.

If we don't update the LR scheduler then timed training won't be optimal as the LR scheduler will be fixed to hit lrf on the default epochs instead of the updated epochs.

So timed training does 3 things at the end of each epoch:

Check the time that's been spent on the past epochs and figure out an average time per epoch.
Compute new epochs to end training perfectly on the requested time.
Update the LR scheduler to hit final lr on the new final epochs.

Signed-off-by: Glenn Jocher <[email protected]>

glenn-jocher · 2024-04-01T21:39:02Z

@Laughing-q ok I moved the scheduler.step() line to the beginning of the train loop instead of the end. The new LR will always be 1 step ahead of the previous one, but I think this fine. Resume looks good now. I CTRL+C a run 3 times and resumed 3 times here:

What do you think?

Laughing-q · 2024-04-02T03:26:03Z

@glenn-jocher Looks good to me!
Actually this is interesting, moving scheduler.step() to the beginning seems to be the correct one. I launched four training to test the lr change:

main branch, default settings vs warmup_epochs=0(to check out what the lr looks like if scheduler starts working at the beginning.)
current branch, default settings vs warmup_epochs=0, which does scheduler.step() at the beginning.

glenn-jocher · 2024-04-02T07:22:35Z

Nice!!

Co-authored-by: Lakshantha Dissanayake <[email protected]> Co-authored-by: RizwanMunawar <[email protected]> Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: UltralyticsAssistant <[email protected]> Co-authored-by: gs80140 <[email protected]>

glenn-jocher · 2024-04-02T09:55:16Z

@Laughing-q PR merged!

Signed-off-by: Glenn Jocher <[email protected]> Co-authored-by: UltralyticsAssistant <[email protected]> Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: EunChan Kim <[email protected]> Co-authored-by: Lakshantha Dissanayake <[email protected]> Co-authored-by: RizwanMunawar <[email protected]> Co-authored-by: gs80140 <[email protected]>

Laughing-q and others added 2 commits April 1, 2024 19:56

update scheduler

5526e6f

Auto-format by https://ultralytics.com/actions

14b7ec7

glenn-jocher changed the title ~~attempt to fix lr issue when resuming~~ Attempt to fix lr issue when resuming Apr 1, 2024

glenn-jocher added 2 commits April 1, 2024 19:46

Merge branch 'main' into fix-scheduler

4358861

Update trainer.py

1d44ea9

Burhan-Q added the enhancement New feature or request label Apr 1, 2024

glenn-jocher added 2 commits April 1, 2024 23:18

Update torch_utils.py

b6584f4

Update earlystopping str

02a652e

Signed-off-by: Glenn Jocher <[email protected]>

Eunchan24 and others added 3 commits April 2, 2024 11:52

Merge branch 'main' into fix-scheduler

097b6c1

Update __init__.py

e8cfdd6

glenn-jocher changed the title ~~Attempt to fix lr issue when resuming~~ ultralytics 8.1.42 attempt to fix lr issue when resuming Apr 2, 2024

glenn-jocher changed the title ~~ultralytics 8.1.42 attempt to fix lr issue when resuming~~ ultralytics 8.1.42 learning-rate resume fix Apr 2, 2024

glenn-jocher changed the title ~~ultralytics 8.1.42 learning-rate resume fix~~ Fix learning rate gap on resume Apr 2, 2024

Update __init__.py

4cc634e

glenn-jocher merged commit 1e547e6 into main Apr 2, 2024
10 checks passed

glenn-jocher deleted the fix-scheduler branch April 2, 2024 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix learning rate gap on resume #9468

Fix learning rate gap on resume #9468

Laughing-q commented Apr 1, 2024 •

edited by github-actions bot

Loading

codecov bot commented Apr 1, 2024 •

edited

Loading

Laughing-q commented Apr 1, 2024 •

edited

Loading

Laughing-q commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024 •

edited

Loading

glenn-jocher commented Apr 1, 2024

Laughing-q commented Apr 2, 2024

glenn-jocher commented Apr 2, 2024

glenn-jocher commented Apr 2, 2024

Fix learning rate gap on resume #9468

Fix learning rate gap on resume #9468

Conversation

Laughing-q commented Apr 1, 2024 • edited by github-actions bot Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

codecov bot commented Apr 1, 2024 • edited Loading

Codecov Report

Laughing-q commented Apr 1, 2024 • edited Loading

Laughing-q commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024

glenn-jocher commented Apr 1, 2024 • edited Loading

glenn-jocher commented Apr 1, 2024

Laughing-q commented Apr 2, 2024

glenn-jocher commented Apr 2, 2024

glenn-jocher commented Apr 2, 2024

Laughing-q commented Apr 1, 2024 •

edited by github-actions bot

Loading

codecov bot commented Apr 1, 2024 •

edited

Loading

Laughing-q commented Apr 1, 2024 •

edited

Loading

glenn-jocher commented Apr 1, 2024 •

edited

Loading