Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Fix wandb logger drop result bug #913

Merged
merged 7 commits into from
Apr 9, 2021
Merged

[Fix] Fix wandb logger drop result bug #913

merged 7 commits into from
Apr 9, 2021

Conversation

shenmishajing
Copy link
Contributor

@shenmishajing shenmishajing commented Mar 29, 2021

You can find details at #911.

@CLAassistant
Copy link

CLAassistant commented Mar 29, 2021

CLA assistant check
All committers have signed the CLA.

@zhouzaida
Copy link
Collaborator

Advice:
fix wandb logger drop result bug by delete step param
->
[Fix] Fix wandb logger drop result bug

@shenmishajing shenmishajing changed the title fix wandb logger drop result bug by delete step param [Fix] Fix wandb logger drop result bug Mar 29, 2021
@zhouzaida
Copy link
Collaborator

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

image

@xvjiarui
Copy link
Collaborator

Hi @shenmishajing
Will commit argumentation do the trick?

@zhouzaida
Copy link
Collaborator

Hi @shenmishajing
Will commit argumentation do the trick?

"You can also set commit=False in wandb.log to accumulate metrics, just be sure to call wandb.log without the commit flag to persist the metrics."

@zhouzaida
Copy link
Collaborator

For example, if you have training and validation steps you'd like to align, pass us your own step counter: wandb.log({"acc":1, "global_step":1}). Then in the graphs choose "global_step" as the x-axis.

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

@shenmishajing
Copy link
Contributor Author

shenmishajing commented Mar 30, 2021

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

image

Does it matter? I didn't pay attention to the meaning of step var when I use wandb.
And the step var minus self._iter is always equal to number of val epoch we have passed. So, I think we can ignore this.

Hi @shenmishajing
Will commit argumentation do the trick?

"You can also set commit=False in wandb.log to accumulate metrics, just be sure to call wandb.log without the commit flag to persist the metrics."

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

@zhouzaida
Copy link
Collaborator

@shenmishajing

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

We can choose "global_step" as the x-axis in the graphs or do nothing.

@shenmishajing
Copy link
Contributor Author

For example, if you have training and validation steps you'd like to align, pass us your own step counter: wandb.log({"acc":1, "global_step":1}). Then in the graphs choose "global_step" as the x-axis.

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        # if tags:
        #     self.wandb.log(
        #         tags, step=self.get_iter(runner), commit=self.commit)
        if tags:
            tags['global_step'] =self.get_iter(runner)
            self.wandb.log(
                tags, commit=self.commit)

I approve this.

@xvjiarui
Copy link
Collaborator

If user doesn't use 'val' pipeline, this PR will change the original behavior.

@shenmishajing
Copy link
Contributor Author

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

@xvjiarui
Copy link
Collaborator

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

Hi @shenmishajing
What will happen if we just set commit=True and change nothing else?

@shenmishajing
Copy link
Contributor Author

If user doesn't use 'val' pipeline, this PR will change the original behavior.

If user doesn't use 'val' pipeline, wandb logger will log once every train iter, so wandb step will always equal to self._iter var. By the way, wandb step is equal to self._iter +1 now.
So, I think we can ignore this.
Or, if you persist in this. The solution I mentioned above will not change the original behavior.

yep, if we use the commit argumentation to do the trick. We may need do some change like follows:
In mmcv/runner/hooks/logger/wandb.py
log func from:

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=self.commit)

change to

    @master_only
    def log(self, runner, commit=True):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner) + 0 if commit else 1, commit=commit)

And we have to rewrite after_val_epoch func like follows:

    def after_val_epoch(self, runner):
        runner.log_buffer.average()
        self.log(runner, commit=False)
        if self.reset_flag:
            runner.log_buffer.clear_output()

Well, I think it is a little bit ugly, how about you? We can use it if you gays can accept the magic + 0 if commit else 1

Hi @shenmishajing
What will happen if we just set commit=True and change nothing else?

Do you mean we change the log func like follows?

    @master_only
    def log(self, runner):
        tags = self.get_loggable_tags(runner)
        if tags:
            self.wandb.log(
                tags, step=self.get_iter(runner), commit=True)

In fact, self.commit is True by default. In other word, this do not change anything.

@zhouzaida zhouzaida requested a review from ZwwWayne April 6, 2021 16:06
@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Apr 7, 2021

Lint and CI failed. Could @shenmishajing help fix that?

@shenmishajing
Copy link
Contributor Author

shenmishajing commented Apr 7, 2021

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Apr 7, 2021

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

  1. You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.
  2. You can see why the unit tests fail here.

@shenmishajing
Copy link
Contributor Author

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

  1. You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.
  2. You can see why the unit tests fail here.

I can fix the unit tests bug.
In tests/test_runner/test_hooks.py, the log res we expected defined as

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95
    },
                                      step=6,
                                      commit=True)

But, we have changed the format of log, so I think we have to change it to

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95,
        'global_step': 6
    },
                                      commit=True)

As the lint issus, I have no idea at all. It's ok when I run python -m flake8 mmcv locally. I need some help :).

@shenmishajing
Copy link
Contributor Author

Lint and CI failed. Could @shenmishajing help fix that?

I have no idea, in fact. Anyone want to help me? Or where can I find some doc about this?

  1. You can see why it fails here. And you can use tools like yapf/flake8 and pre-commit to fix the lint issues.
  2. You can see why the unit tests fail here.

I can fix the unit tests bug.
In tests/test_runner/test_hooks.py, the log res we expected defined as

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95
    },
                                      step=6,
                                      commit=True)

But, we have changed the format of log, so I think we have to change it to

    hook.wandb.log.assert_called_with({
        'learning_rate': 0.02,
        'momentum': 0.95,
        'global_step': 6
    },
                                      commit=True)

As the lint issus, I have no idea at all. It's ok when I run python -m flake8 mmcv locally. I need some help :).

Well, I may fix it :).

@codecov
Copy link

codecov bot commented Apr 8, 2021

Codecov Report

Merging #913 (01e7d31) into master (03a2e3a) will decrease coverage by 0.01%.
The diff coverage is 40.00%.

❗ Current head 01e7d31 differs from pull request most recent head a1ec8a7. Consider uploading reports for the commit a1ec8a7 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #913      +/-   ##
==========================================
- Coverage   64.70%   64.68%   -0.02%     
==========================================
  Files         151      151              
  Lines        9599     9603       +4     
  Branches     1758     1759       +1     
==========================================
+ Hits         6211     6212       +1     
- Misses       3034     3036       +2     
- Partials      354      355       +1     
Flag Coverage Δ
unittests 64.68% <40.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcv/runner/hooks/logger/wandb.py 68.75% <40.00%> (-6.25%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03a2e3a...a1ec8a7. Read the comment docs.

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Apr 8, 2021

Hi @shenmishajing
Thanks for the cooperation. Do we have some visualization results to see how it looks before and after this PR?

@shenmishajing
Copy link
Contributor Author

Hi @shenmishajing
Thanks for the cooperation. Do we have some visualization results to see how it looks before and after this PR?

Do you mean the visualization of the log result on wandb site, like this?

You can find details at #911.

Although it works, the step don't equals self._iter anymore.

image

After this PR, it will look like above. Before this PR, it also looks like above. The only difference is the step var on wandb site is not equal to self._iter in mmcv runner code any more. After this PR, a var named global_step will be added, which will equal to self._iter.

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Apr 8, 2021

LGTM. This PR will be merged after approval by @xvjiarui

@xvjiarui
Copy link
Collaborator

xvjiarui commented Apr 8, 2021

The step argument is useful under some cases.
For example, when we training by epochs. The wandb logger log training info after each iteration. Then user terminate the job at the [500/1000] of Epoch[5] and resume from the previous checkpoint Epoch[4].
Before this PR, the new logs from [1/1000] to [500/1000] will be ignored. Starting from [501/1000], logger will log new train infos. This could keep the log consistent and of the same total step (the x axis of the plot).
After this PR, the new logs from [1/1000] to [500/1000] will still be logged after the old [1/1000] to [500/1000], which is not ideal.

So I suggest make this change optional. Such that the original behavior could also be preserved.
We could keep the global_step as it does not affect other behaviors. We may add an argument with_step default to True. If with_step == True, we will log the step from get_iters. Otherwise, we will not log step.

@shenmishajing
Copy link
Contributor Author

The step argument is useful under some cases.
For example, when we training by epochs. The wandb logger log training info after each iteration. Then user terminate the job at the [500/1000] of Epoch[5] and resume from the previous checkpoint Epoch[4].
Before this PR, the new logs from [1/1000] to [500/1000] will be ignored. Starting from [501/1000], logger will log new train infos. This could keep the log consistent and of the same total step (the x axis of the plot).
After this PR, the new logs from [1/1000] to [500/1000] will still be logged after the old [1/1000] to [500/1000], which is not ideal.

So I suggest make this change optional. Such that the original behavior could also be preserved.
We could keep the global_step as it does not affect other behaviors. We may add an argument with_step default to True. If with_step == True, we will log the step from get_iters. Otherwise, we will not log step.

approve

@ZwwWayne ZwwWayne merged commit d636257 into open-mmlab:master Apr 9, 2021
@shenmishajing shenmishajing deleted the fix_wandb_logger_drop_res_bug_by_delete_step_param branch April 9, 2021 07:13
@shenmishajing
Copy link
Contributor Author

When will I get the pre-build version mmcv in pip with this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants