Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics in macro average #303

Merged
merged 44 commits into from
Aug 2, 2021

Conversation

vatch123
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #295 and Fixes #300.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@vatch123
Copy link
Contributor Author

Hey @SkafteNicki changing the weights to account for missing classes seems to fix the accuracy, and precision but not f1 score. Have a look at the following snippet.

import torch
import torchmetrics
from sklearn.metrics import f1_score, accuracy_score, precision_score

y_pred = y_true = torch.tensor([1, 6, 6, 6, 3, 6, 3, 6, 6, 3, 6, 3, 6, 6, 3, 6, 3, 3, 6, 6, 6, 6, 6, 6,
        6, 3, 5, 6, 6, 3, 6, 6, 6, 6, 3, 6, 6, 3, 3, 1, 3, 1, 6, 3, 3, 1, 3, 6,
        4, 6, 6, 6, 6, 6, 6, 3, 3, 6, 3, 6, 1, 0, 5, 3, 6, 6, 6, 6, 3, 0, 6, 3,
        3, 3, 6, 3, 4]) # no class with id=2 !
print(f'Metrics:')
print(f'F1: {f1_score(y_true.numpy().astype(int).tolist(), y_pred.numpy().astype(int).tolist(), average="macro")}')
print(f'ACC: {accuracy_score(y_true.numpy().astype(int).tolist(), y_pred.numpy().astype(int).tolist())}')
print(f'PREC: {precision_score(y_true.numpy().astype(int).tolist(), y_pred.numpy().astype(int).tolist(), average="macro")}')
print(f'TM F1: {torchmetrics.functional.f1(y_pred, y_true, average="macro", num_classes=7)}')
print(f'TM acc: {torchmetrics.functional.accuracy(y_pred, y_true, average="macro", num_classes=7)}')
print(f'TM prec: {torchmetrics.functional.precision(y_pred, y_true, average="macro", num_classes=7)}')
Metrics:
F1: 1.0
ACC: 1.0
PREC: 1.0
TM F1: 0.8571428656578064
TM acc: 1.0
TM prec: 1.0

I would look into the f1 score issue tomorrow. I guess we should test this.

@SkafteNicki
Copy link
Member

Hi @vatch123,
Could you try deleting this line of code:
https://github.com/PyTorchLightning/metrics/blob/1841cad3839f5d1907a1bb8bb6a266de5c5333f9/torchmetrics/functional/classification/f_beta.py#L52
I am pretty sure it is a redundant line of code which is messing things up

@Borda Borda added the bug / fix Something isn't working label Jun 17, 2021
@codecov
Copy link

codecov bot commented Jun 18, 2021

Codecov Report

Merging #303 (4b01415) into master (d8b89e0) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
+ Coverage   96.11%   96.12%   +0.01%     
==========================================
  Files         124      124              
  Lines        3986     4002      +16     
==========================================
+ Hits         3831     3847      +16     
  Misses        155      155              
Flag Coverage Δ
Linux 75.56% <64.70%> (-0.06%) ⬇️
Windows 75.56% <64.70%> (-0.06%) ⬇️
cpu 96.05% <100.00%> (+0.01%) ⬆️
gpu 96.05% <100.00%> (+0.01%) ⬆️
macOS 96.05% <100.00%> (+0.01%) ⬆️
pytest 96.12% <100.00%> (+0.01%) ⬆️
python3.6 95.27% <100.00%> (+0.01%) ⬆️
python3.8 96.05% <100.00%> (+0.01%) ⬆️
python3.9 95.95% <100.00%> (+0.01%) ⬆️
torch1.3.1 95.27% <100.00%> (+0.01%) ⬆️
torch1.4.0 95.35% <100.00%> (+0.01%) ⬆️
torch1.9.0 95.95% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
torchmetrics/functional/classification/accuracy.py 93.15% <100.00%> (+0.39%) ⬆️
torchmetrics/functional/classification/f_beta.py 100.00% <100.00%> (ø)
...rics/functional/classification/precision_recall.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8b89e0...4b01415. Read the comment docs.

@vatch123
Copy link
Contributor Author

vatch123 commented Jun 18, 2021

Hi @vatch123,
Could you try deleting this line of code:
https://github.com/PyTorchLightning/metrics/blob/1841cad3839f5d1907a1bb8bb6a266de5c5333f9/torchmetrics/functional/classification/f_beta.py#L52

I am pretty sure it is a redundant line of code which is messing things up

Yes doing that seems to fix it. Although some other tests do fail.

@Borda Borda marked this pull request as ready for review June 18, 2021 19:01
@Borda
Copy link
Member

Borda commented Jun 21, 2021

@SkafteNicki @vatch123 how is it going here? :]

@vatch123
Copy link
Contributor Author

@SkafteNicki @vatch123 how is it going here? :]

Hey sorry was a bit busy this weekend. However, removing the line with @SkafteNicki suggested breaks the accuracy tests. So something odd is happening which I wasn't able to figure out then. Let me have a look again

@vatch123
Copy link
Contributor Author

Hey, @SkafteNicki I was probing the issue. On local testing this particular test fails test_accuracy.py::test_average_accuracy[preds4-target4-4-0.041666666666666664-macro-samplewise] that is due to the following assertion error

===================================================== FAILURES =====================================================
__________________ test_average_accuracy[preds4-target4-4-0.041666666666666664-macro-samplewise] ___________________

preds = tensor([[[[0.1000, 0.1000, 0.1000],
          [0.2000, 0.2000, 0.2000],
          [0.3000, 0.3000, 0.3000],
          ...0.1000],
          [0.2000, 0.2000, 0.2000],
          [0.3000, 0.3000, 0.3000],
          [0.4000, 0.4000, 0.4000]]]])
target = tensor([[[1, 1, 0],
         [2, 2, 2],
         [3, 3, 3]],

        [[2, 2, 0],
         [1, 1, 1],
         [0, 0, 0]]])
num_classes = 4, exp_result = 0.041666666666666664, average = 'macro', mdmc_average = 'samplewise'

    @pytest.mark.parametrize(
        "preds, target, num_classes, exp_result, average, mdmc_average",
        [
            (_topk_preds_mcls, _topk_target_mcls, 4, 1 / 4, "macro", None),
            (_topk_preds_mcls, _topk_target_mcls, 4, 1 / 6, "weighted", None),
            (_topk_preds_mcls, _topk_target_mcls, 4, [0., 0., 0., 1.], "none", None),
            (_topk_preds_mcls, _topk_target_mcls, 4, 1 / 6, "samples", None),
            (_topk_preds_mdmc, _topk_target_mdmc, 4, 1 / 24, "macro", "samplewise"),
            (_topk_preds_mdmc, _topk_target_mdmc, 4, 1 / 6, "weighted", "samplewise"),
            (_topk_preds_mdmc, _topk_target_mdmc, 4, [0., 0., 0., 1 / 6], "none", "samplewise"),
            (_topk_preds_mdmc, _topk_target_mdmc, 4, 1 / 6, "samples", "samplewise"),
            (_topk_preds_mdmc, _topk_target_mdmc, 4, 1 / 6, "samples", "global"),
            (_av_preds_ml, _av_target_ml, 4, 5 / 8, "macro", None),
            (_av_preds_ml, _av_target_ml, 4, 0.70000005, "weighted", None),
            (_av_preds_ml, _av_target_ml, 4, [1 / 2, 1 / 2, 1., 1 / 2], "none", None),
            (_av_preds_ml, _av_target_ml, 4, 5 / 8, "samples", None),
        ],
    )
    def test_average_accuracy(preds, target, num_classes, exp_result, average, mdmc_average):
        acc = Accuracy(num_classes=num_classes, average=average, mdmc_average=mdmc_average)
    
        for batch in range(preds.shape[0]):
            acc(preds[batch], target[batch])
    
>       assert (acc.compute() == tensor(exp_result)).all()
E       assert tensor(False)
E        +  where tensor(False) = <built-in method all of Tensor object at 0x7f04240c6f40>()
E        +    where <built-in method all of Tensor object at 0x7f04240c6f40> = tensor(0.1667) == tensor(0.0417).all
E        +      where tensor(0.1667) = <function Accuracy.compute at 0x7f04b3c2e0d0>()
E        +        where <function Accuracy.compute at 0x7f04b3c2e0d0> = Accuracy().compute
E        +      and   tensor(0.0417) = tensor(0.041666666666666664)

test_accuracy.py:294: AssertionError

Seems like the calculated accuracy is getting rounded off somewhere. But the trace shows one more tensor(0.1667). Am I interpreting it correctly? The entire test suit is taking lot of time to run at my end don't know why though.

@maximsch2
Copy link
Contributor

As another comment, can you please also add the test case you used to find the issue to the tests to avoid regressions in the future?

For test suit being slow - you should be able to run just individual test files in pytest while you are debugging a particular test to avoid waiting for everything to run.

@vatch123
Copy link
Contributor Author

Hey @SkafteNicki, I am a little stuck here. I can't figure out which inputs are breaking the tests on F1. The tests don't run locally completely throwing this error - RuntimeError: Application timeout caused pair closure and on the CI also I can only see the tests which break no other details as CI runs don't complete. I am assuming that the names of the tests such as test_fbeta_f1[True-True-preds6-target6-5-None-None-_sk_fbeta_f1-None-macro-metric_class0-metric_fn0-sk_fn0] have some information about the inputs of the test but I am unable to correlate it completely.

@Borda
Copy link
Member

Borda commented Jul 1, 2021

@SkafteNicki @vatch123 any chance we get for tomorrow bugfix release?

@vatch123
Copy link
Contributor Author

vatch123 commented Jul 3, 2021

Hi, Sorry for the late reply I am little stuck here and actually unable to figure what is actually breaking. I guess I would need some more inputs on this.

Hey @SkafteNicki, I am a little stuck here. I can't figure out which inputs are breaking the tests on F1. The tests don't run locally completely throwing this error - RuntimeError: Application timeout caused pair closure and on the CI also I can only see the tests which break no other details as CI runs don't complete. I am assuming that the names of the tests such as test_fbeta_f1[True-True-preds6-target6-5-None-None-_sk_fbeta_f1-None-macro-metric_class0-metric_fn0-sk_fn0] have some information about the inputs of the test but I am unable to correlate it completely.

@Borda
Copy link
Member

Borda commented Jul 26, 2021

@vatch123 how is it going here? 🐰

@vatch123
Copy link
Contributor Author

@vatch123 how is it going here? 🐰

Hey. Sorry been busy recently did not get a chance to probe deeper. I tried with a debugger but haven't been able to pinpoint the error. Will need some time and help on this I guess.

@mergify mergify bot removed the has conflicts label Aug 2, 2021
@mergify mergify bot added the ready label Aug 2, 2021
@SkafteNicki SkafteNicki enabled auto-merge (squash) August 2, 2021 14:47
@mergify mergify bot added the has conflicts label Aug 2, 2021
@mergify mergify bot removed the has conflicts label Aug 2, 2021
@SkafteNicki SkafteNicki merged commit 79cb5e2 into Lightning-AI:master Aug 2, 2021
@vatch123
Copy link
Contributor Author

vatch123 commented Aug 2, 2021

Hey @SkafteNicki. Thanks for the help here. Can you let me know what was the actual issue? I see you mentioned dist_sync_on_step not working and you added one more test. Can you just briefly describe the issue? Thanks again.

@Scass0807
Copy link

Yes @SkafteNicki @Borda Can someone please explain whether or not thins works for dist_sync_on_step. During testing I am getting a macro accuracy value of 0.1996 when it should be about 0.65. I calculated manually after testing by extracting all the data to a csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working ready
Projects
None yet
7 participants