[Fix] Fix wandb logger drop result bug #913

shenmishajing · 2021-03-29T07:30:50Z

You can find details at #911.

CLAassistant · 2021-03-29T07:30:54Z

All committers have signed the CLA.

zhouzaida · 2021-03-29T08:18:00Z

Advice:
fix wandb logger drop result bug by delete step param
->
[Fix] Fix wandb logger drop result bug

zhouzaida · 2021-03-29T12:22:58Z

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

xvjiarui · 2021-03-29T16:26:56Z

Hi @shenmishajing
Will commit argumentation do the trick?

zhouzaida · 2021-03-30T02:40:29Z

Hi @shenmishajing
Will commit argumentation do the trick?

"You can also set commit=False in wandb.log to accumulate metrics, just be sure to call wandb.log without the commit flag to persist the metrics."

zhouzaida · 2021-03-30T02:47:44Z

For example, if you have training and validation steps you'd like to align, pass us your own step counter: wandb.log({"acc":1, "global_step":1}). Then in the graphs choose "global_step" as the x-axis.

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

shenmishajing · 2021-03-30T02:54:10Z

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

Does it matter? I didn't pay attention to the meaning of step var when I use wandb.
And the step var minus self._iter is always equal to number of val epoch we have passed. So, I think we can ignore this.

Hi @shenmishajing
Will commit argumentation do the trick?

"You can also set commit=False in wandb.log to accumulate metrics, just be sure to call wandb.log without the commit flag to persist the metrics."

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

zhouzaida · 2021-03-30T02:58:55Z

@shenmishajing

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

We can choose "global_step" as the x-axis in the graphs or do nothing.

shenmishajing · 2021-03-30T02:59:27Z

For example, if you have training and validation steps you'd like to align, pass us your own step counter: wandb.log({"acc":1, "global_step":1}). Then in the graphs choose "global_step" as the x-axis.

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

I approve this.

xvjiarui · 2021-03-30T03:21:09Z

If user doesn't use 'val' pipeline, this PR will change the original behavior.

shenmishajing · 2021-03-30T06:44:45Z

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

xvjiarui · 2021-03-30T22:36:50Z

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.
yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:
    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)
change to
    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)
And we have to rewrite after_val_epoch func like follows:
    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()
Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

Hi @shenmishajing
What will happen if we just set commit=True and change nothing else?

shenmishajing · 2021-04-02T01:53:59Z

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.
yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:
    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)
change to
    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)
And we have to rewrite after_val_epoch func like follows:
    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()
Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1
Hi @shenmishajing
What will happen if we just set commit=True and change nothing else?

Do you mean we change the log func like follows?

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=True)

In fact, self.commit is True by default. In other word, this do not change anything.

ZwwWayne · 2021-04-07T01:42:52Z

Lint and CI failed. Could @shenmishajing help fix that?

shenmishajing · 2021-04-07T07:52:04Z

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

ZwwWayne · 2021-04-07T15:12:17Z

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.
You can see why the unit tests fail here.

shenmishajing · 2021-04-08T01:31:50Z

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.

You can see why the unit tests fail here.

I can fix the unit tests bug.
In tests/test_runner/test_hooks.py, the log res we expected defined as

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95
    },
                                      step=6,
                                      commit=True)

But, we have changed the format of log, so I think we have to change it to

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95,
        'global_step': 6
    },
                                      commit=True)

As the lint issus, I have no idea at all. It's ok when I run python -m flake8 mmcv locally. I need some help :).

shenmishajing · 2021-04-08T01:50:46Z

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.

You can see why the unit tests fail here.

I can fix the unit tests bug.
In tests/test_runner/test_hooks.py, the log res we expected defined as
    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95
    },
                                      step=6,
                                      commit=True)
But, we have changed the format of log, so I think we have to change it to
    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95,
        'global_step': 6
    },
                                      commit=True)
As the lint issus, I have no idea at all. It's ok when I run python -m flake8 mmcv locally. I need some help :).

Well, I may fix it :).

codecov · 2021-04-08T01:55:04Z

Codecov Report

Merging #913 (01e7d31) into master (03a2e3a) will decrease coverage by 0.01%.
The diff coverage is 40.00%.

❗ Current head 01e7d31 differs from pull request most recent head a1ec8a7. Consider uploading reports for the commit a1ec8a7 to get more accurate results

@@            Coverage Diff             @@
##           master     #913      +/-   ##
==========================================
- Coverage   64.70%   64.68%   -0.02%     
==========================================
  Files         151      151              
  Lines        9599     9603       +4     
  Branches     1758     1759       +1     
==========================================
+ Hits         6211     6212       +1     
- Misses       3034     3036       +2     
- Partials      354      355       +1

Flag	Coverage Δ
unittests	`64.68% <40.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
mmcv/runner/hooks/logger/wandb.py	`68.75% <40.00%> (-6.25%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03a2e3a...a1ec8a7. Read the comment docs.

…_res_bug_by_delete_step_param

ZwwWayne · 2021-04-08T05:27:06Z

Hi @shenmishajing ，
Thanks for the cooperation. Do we have some visualization results to see how it looks before and after this PR?

shenmishajing · 2021-04-08T07:11:33Z

Hi @shenmishajing ，
Thanks for the cooperation. Do we have some visualization results to see how it looks before and after this PR?

Do you mean the visualization of the log result on wandb site, like this?

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

After this PR, it will look like above. Before this PR, it also looks like above. The only difference is the step var on wandb site is not equal to self._iter in mmcv runner code any more. After this PR, a var named global_step will be added, which will equal to self._iter.

ZwwWayne · 2021-04-08T08:35:54Z

LGTM. This PR will be merged after approval by @xvjiarui

xvjiarui · 2021-04-08T17:25:51Z

The step argument is useful under some cases.
For example, when we training by epochs. The wandb logger log training info after each iteration. Then user terminate the job at the [500/1000] of Epoch[5] and resume from the previous checkpoint Epoch[4].
Before this PR, the new logs from [1/1000] to [500/1000] will be ignored. Starting from [501/1000], logger will log new train infos. This could keep the log consistent and of the same total step (the x axis of the plot).
After this PR, the new logs from [1/1000] to [500/1000] will still be logged after the old [1/1000] to [500/1000], which is not ideal.

So I suggest make this change optional. Such that the original behavior could also be preserved.
We could keep the global_step as it does not affect other behaviors. We may add an argument with_step default to True. If with_step == True, we will log the step from get_iters. Otherwise, we will not log step.

shenmishajing · 2021-04-09T00:58:34Z

The step argument is useful under some cases.
For example, when we training by epochs. The wandb logger log training info after each iteration. Then user terminate the job at the [500/1000] of Epoch[5] and resume from the previous checkpoint Epoch[4].
Before this PR, the new logs from [1/1000] to [500/1000] will be ignored. Starting from [501/1000], logger will log new train infos. This could keep the log consistent and of the same total step (the x axis of the plot).
After this PR, the new logs from [1/1000] to [500/1000] will still be logged after the old [1/1000] to [500/1000], which is not ideal.

So I suggest make this change optional. Such that the original behavior could also be preserved.
We could keep the global_step as it does not affect other behaviors. We may add an argument with_step default to True. If with_step == True, we will log the step from get_iters. Otherwise, we will not log step.

approve

…_res_bug_by_delete_step_param

shenmishajing · 2021-04-13T02:44:24Z

When will I get the pre-build version mmcv in pip with this pr?

fix wandb logger drop result bug by delete step param

50342e3

shenmishajing changed the title ~~fix wandb logger drop result bug by delete step param~~ [Fix] Fix wandb logger drop result bug Mar 29, 2021

add global_step in wandb log to help align train and val step log

1807354

zhouzaida approved these changes Mar 30, 2021

View reviewed changes

zhouzaida requested a review from ZwwWayne April 6, 2021 16:06

shenmishajing added 2 commits April 8, 2021 09:36

fix wandb hook test unit fail bug

0c7c3ff

fix lint issue

57d83de

Merge remote-tracking branch 'mmcv/master' into fix_wandb_logger_drop…

56c9475

…_res_bug_by_delete_step_param

shenmishajing added 2 commits April 9, 2021 09:04

add with_step param of WandbLoggerHook in wandb.py

ed213c3

Merge remote-tracking branch 'mmcv/master' into fix_wandb_logger_drop…

a1ec8a7

…_res_bug_by_delete_step_param

xvjiarui approved these changes Apr 9, 2021

View reviewed changes

ZwwWayne merged commit d636257 into open-mmlab:master Apr 9, 2021

shenmishajing deleted the fix_wandb_logger_drop_res_bug_by_delete_step_param branch April 9, 2021 07:13

zhouzaida mentioned this pull request Jul 1, 2021

only log wandb step during training #1139

Closed

zhouzaida mentioned this pull request Dec 27, 2021

add artifact logging to wandb hook #1616

Merged

zhouzaida mentioned this pull request Jun 25, 2022

Wandb logging bug using Iteration based runner #2069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Fix wandb logger drop result bug #913

[Fix] Fix wandb logger drop result bug #913

shenmishajing commented Mar 29, 2021 •

edited

Loading

CLAassistant commented Mar 29, 2021 •

edited

Loading

zhouzaida commented Mar 29, 2021

zhouzaida commented Mar 29, 2021

xvjiarui commented Mar 29, 2021

zhouzaida commented Mar 30, 2021

zhouzaida commented Mar 30, 2021

shenmishajing commented Mar 30, 2021 •

edited

Loading

zhouzaida commented Mar 30, 2021

shenmishajing commented Mar 30, 2021

xvjiarui commented Mar 30, 2021

shenmishajing commented Mar 30, 2021

xvjiarui commented Mar 30, 2021

shenmishajing commented Apr 2, 2021

ZwwWayne commented Apr 7, 2021

shenmishajing commented Apr 7, 2021 •

edited

Loading

ZwwWayne commented Apr 7, 2021

shenmishajing commented Apr 8, 2021

shenmishajing commented Apr 8, 2021

codecov bot commented Apr 8, 2021 •

edited

Loading

ZwwWayne commented Apr 8, 2021

shenmishajing commented Apr 8, 2021

ZwwWayne commented Apr 8, 2021

xvjiarui commented Apr 8, 2021

shenmishajing commented Apr 9, 2021

shenmishajing commented Apr 13, 2021

[Fix] Fix wandb logger drop result bug #913

[Fix] Fix wandb logger drop result bug #913

Conversation

shenmishajing commented Mar 29, 2021 • edited Loading

CLAassistant commented Mar 29, 2021 • edited Loading

zhouzaida commented Mar 29, 2021

zhouzaida commented Mar 29, 2021

xvjiarui commented Mar 29, 2021

zhouzaida commented Mar 30, 2021

zhouzaida commented Mar 30, 2021

shenmishajing commented Mar 30, 2021 • edited Loading

zhouzaida commented Mar 30, 2021

shenmishajing commented Mar 30, 2021

xvjiarui commented Mar 30, 2021

shenmishajing commented Mar 30, 2021

xvjiarui commented Mar 30, 2021

shenmishajing commented Apr 2, 2021

ZwwWayne commented Apr 7, 2021

shenmishajing commented Apr 7, 2021 • edited Loading

ZwwWayne commented Apr 7, 2021

shenmishajing commented Apr 8, 2021

shenmishajing commented Apr 8, 2021

codecov bot commented Apr 8, 2021 • edited Loading

Codecov Report

ZwwWayne commented Apr 8, 2021

shenmishajing commented Apr 8, 2021

ZwwWayne commented Apr 8, 2021

xvjiarui commented Apr 8, 2021

shenmishajing commented Apr 9, 2021

shenmishajing commented Apr 13, 2021

shenmishajing commented Mar 29, 2021 •

edited

Loading

CLAassistant commented Mar 29, 2021 •

edited

Loading

shenmishajing commented Mar 30, 2021 •

edited

Loading

shenmishajing commented Apr 7, 2021 •

edited

Loading

codecov bot commented Apr 8, 2021 •

edited

Loading