Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhance] Learning rate in log can show the base learning rate of optimizer #1019

Merged
merged 33 commits into from
Jun 8, 2023

Conversation

AkideLiu
Copy link
Contributor

@AkideLiu AkideLiu commented Mar 26, 2023

Fix for issue #482

This is Akide. We are a course group (COMP SCI 4023 - Software Process Improvement) of five members from The University of Adelaide, trying to complete this issue and contribute. We will try to kick off this issue this week. If you have any more ideas, please don't hesitate to contact us.

Motivation

For example, the config of optim_wrapper is:

optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='AdamW', lr=0.0002, weight_decay=0.0001),
    clip_grad=dict(max_norm=0.1, norm_type=2),
    paramwise_cfg=dict(
        custom_keys={
            'backbone': dict(lr_mult=0.1),
            'sampling_offsets': dict(lr_mult=0.1),
            'reference_points': dict(lr_mult=0.1)
        }))

The log will record the learning rate of model.backbone(2e-5), rather than the base learning rate of optimizer(2e-4).

Modification

We have initiated an initial investigation and identified the following logics that may be related to the issue at hand:

  1. Printing logs are primarily managed by runner.message_hub.update_scalar.
  2. Logging scalars are exported to message_hub in RuntimeInfoHook::before_train_iter.
  3. The learning rate is retrieved from runner.optim_wrapper.get_lr().
  4. In OptimWrapper::get_lr, an unsorted list of learning rates is created from the optimizer parameter groups.

The core issue can potentially be resolved by sorting the learning rate list according to paramwise_cfg. At this stage, we are planning to create a preliminary pull request to sort the learning rate list without conditional checking. We believe it is essential to define rules to automatically identify the base learning rate in the optimizer parameter groups.

    def get_lr(self) -> Dict[str, List[float]]:
        """Get the learning rate of the optimizer.

        Provide unified interface to get learning rate of optimizer.

        Returns:
            Dict[str, List[float]]: Learning rate of the optimizer.
        """
        lr = [group['lr'] for group in self.param_groups]
        lr.sort(reverse=True)
        return dict(lr=lr)

The output shows that :

03/26 15:31:49 - mmengine - INFO - Checkpoints will be saved to /home/akide/pychram_remote/mmengine-dev/examples/work_dir.
03/26 15:31:50 - mmengine - INFO - Epoch(train) [1][  10/1563]  lr: 1.0000e-03  eta: 0:06:40  time: 0.1285  data_time: 0.0032  memory: 369  loss: 5.4107
03/26 15:31:50 - mmengine - INFO - Epoch(train) [1][  20/1563]  lr: 1.0000e-03  eta: 0:03:43  time: 0.0152  data_time: 0.0029  memory: 369  loss: 3.0671
03/26 15:31:51 - mmengine - INFO - Epoch(train) [1][  30/1563]  lr: 1.0000e-03  eta: 0:02:43  time: 0.0150  data_time: 0.0033  memory: 369  loss: 2.6957

Please let us know if you have any concerns or suggestions. We appreciate your input and look forward to collaborating on this matter.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repos?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

# |<----  Using a Maximum Of 50 Characters  ---->|


# Explain why this change is being made
# |<----   Try To Limit Each Line to a Maximum Of 72 Characters   ---->|

# Provide links or keys to any relevant tickets, articles or other resources
# Example: Github issue open-mmlab#23

# --- COMMIT END ---
# Type can be 
#    🔬 exp	 (new experiment)
#    🚀 feat     (new feature)
#    🧐 fix      (bug fix)
#    🏗️ refactor (refactoring production code)
#    🍻 style    (formatting, missing semi colons, etc; no code change)
#    📝 docs     (changes to documentation)
#    ✅ test     (adding or refactoring tests; no production code change)
#    💚 chore    (updating grunt tasks etc; no production code change)
# --------------------
@AkideLiu AkideLiu requested a review from HAOCHENYE as a code owner March 26, 2023 05:07
@CLAassistant
Copy link

CLAassistant commented Mar 26, 2023

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Attention: Patch coverage is 66.66667% with 11 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@eb2dc67). Learn more about missing BASE report.

Files with missing lines Patch % Lines
mmengine/optim/optimizer/optimizer_wrapper.py 70.00% 3 Missing and 3 partials ⚠️
mmengine/optim/optimizer/amp_optimizer_wrapper.py 0.00% 3 Missing ⚠️
mmengine/optim/optimizer/optimizer_wrapper_dict.py 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1019   +/-   ##
=======================================
  Coverage        ?   77.72%           
=======================================
  Files           ?      139           
  Lines           ?    11500           
  Branches        ?     2333           
=======================================
  Hits            ?     8938           
  Misses          ?     2151           
  Partials        ?      411           
Flag Coverage Δ
unittests 77.72% <66.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# |<----  Using a Maximum Of 50 Characters  ---->|


# Explain why this change is being made
# |<----   Try To Limit Each Line to a Maximum Of 72 Characters   ---->|

# Provide links or keys to any relevant tickets, articles or other resources
# Example: Github issue open-mmlab#23

# --- COMMIT END ---
# Type can be 
#    🔬 exp	 (new experiment)
#    🚀 feat     (new feature)
#    🧐 fix      (bug fix)
#    🏗️ refactor (refactoring production code)
#    🍻 style    (formatting, missing semi colons, etc; no code change)
#    📝 docs     (changes to documentation)
#    ✅ test     (adding or refactoring tests; no production code change)
#    💚 chore    (updating grunt tasks etc; no production code change)
# --------------------
@AkideLiu AkideLiu requested a review from zhouzaida as a code owner April 3, 2023 17:26
@AkideLiu
Copy link
Contributor Author

AkideLiu commented Apr 3, 2023

@HAOCHENYE , I have implemented an update to the optimizer parameter group to include a new parameter. However, it appears that this update has a significant impact on existing unit tests, particularly in the tests/test_optim/test_scheduler module. I am currently working on fixing these affected test cases. Could you please assist me with this task?

============================================= short test summary info =============================================
FAILED tests/test_optim/test_scheduler/test_param_scheduler.py::TestParameterSchedulerOptimWrapper::test_get_last_value - AssertionError: LR is wrong in epoch 0: expected 0.05, got 0.05
FAILED tests/test_optim/test_scheduler/test_param_scheduler.py::TestParameterSchedulerOptimWrapper::test_reduce_on_plateau_scheduler - ValueError: expected 3 min_lrs, got 2
============================= 2 failed, 202 passed, 8 skipped, 80 warnings in 11.48s ==============================

mmengine/optim/optimizer/optimizer_wrapper.py Outdated Show resolved Hide resolved
mmengine/optim/optimizer/optimizer_wrapper.py Outdated Show resolved Hide resolved
tests/test_logging/test_message_hub.py Outdated Show resolved Hide resolved
@AkideLiu AkideLiu requested a review from RangiLyu as a code owner June 2, 2023 14:39
@AkideLiu
Copy link
Contributor Author

AkideLiu commented Jun 2, 2023

Hi @HAOCHENYE , we have applied a fix that passed all tests locally, could you please review this PR ?

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================ 912 passed, 88 skipped, 670 warnings in 155.40s (0:02:35) ============================

@AkideLiu
Copy link
Contributor Author

AkideLiu commented Jun 4, 2023

Hi @HAOCHENYE,

We have conducted distributed training to assess if there are any backward compatibility issues that may arise from this pull request (PR). Fortunately, we have found no breaking changes that would affect the downstream repository. We tested the current PR using MMseg with 4 GPUs, and the attached logs provide more details about the testing process.

20230604_171640.log

@AkideLiu AkideLiu requested a review from HAOCHENYE June 5, 2023 15:36
@AkideLiu
Copy link
Contributor Author

AkideLiu commented Jun 6, 2023

Hi @HAOCHENYE , i have merged your refine branch into this pr, please let me know anything else we can improve.

@zhouzaida
Copy link
Collaborator

Hi @AkideLiu , should we also update the logic to get lr in

lr_dict[f'{name}.lr'] = optim_wrapper.get_lr()['lr']

@AkideLiu
Copy link
Contributor Author

AkideLiu commented Jun 8, 2023

@zhouzaida sure, i have update the logic of get lr in optimizer_wrapper_dict

@AkideLiu AkideLiu requested a review from zhouzaida June 8, 2023 07:46
@zhouzaida zhouzaida added this to the 0.8.0 milestone Jun 8, 2023
@AkideLiu AkideLiu requested a review from zhouzaida June 8, 2023 09:28
zhouzaida
zhouzaida previously approved these changes Jun 8, 2023
@zhouzaida
Copy link
Collaborator

Hi @AkideLiu , thanks for your contribution. This PR will be merged after passing the CI.

@zhouzaida zhouzaida changed the title [Fix][Bug] Learning rate in log will show the first params_group in optimizer, rather than the base learning rate of optimizer [Enhance] Learning rate in log can show the base learning rate of optimizer Jun 8, 2023
@zhouzaida zhouzaida merged commit 94e7a3b into open-mmlab:main Jun 8, 2023
@AkideLiu
Copy link
Contributor Author

AkideLiu commented Jun 8, 2023

Hi @HAOCHENYE @zhouzaida , We appreciate your efforts and support during the development of this PR, hope mmengine project has a great future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants