Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

Closed
FeryET opened this issue Jan 12, 2022 · 12 comments · Fixed by #1195
Labels
question Further information is requested working as intended
Milestone

Comments

@FeryET
Copy link

FeryET commented Jan 12, 2022

🐛 Bug

I am trying to report F1, Accuracy, Precision and Recall for a binary classification task. I have collected these metrics in a MetricCollection module, and run them for my train, val and test stages. Upon inspecting the results, I can see that all of these metrics are showing the exact same value.

To Reproduce

Create a random binary classification task and add these metrics together in a metric collection.

Code sample

I have uploaded a very minimal example in this notebook. As you can see the values reported by torchmetrics doesn't align with classification_report.

Expected behavior

F1, Precision, Recall and Accuracy should usually differ. It should be very unlikely to see all of them match exactly.

Environment

  • PyTorch Version (e.g., 1.0): 1.10.0
  • OS (e.g., Linux): Ubuntu 20.04.
  • How you installed PyTorch conda.
  • Build command you used (if compiling from source):
  • Python version: 3.9.
  • CUDA/cuDNN version: None.
  • GPU models and configuration: None.
  • Any other relevant information: None.

Additional context

I have also asked this question in a discussion form yesterday thinking it was a problem on my part, but after looking up the sitaution, I think this might be a bug.

#743

@FeryET FeryET added bug / fix Something isn't working help wanted Extra attention is needed labels Jan 12, 2022
@github-actions
Copy link

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Jan 12, 2022

seem to be duplicate to #543 if you still find this in need, feel free to reopen 🐰

@Borda Borda closed this as completed Jan 12, 2022
@Borda Borda added question Further information is requested working as intended and removed bug / fix Something isn't working help wanted Extra attention is needed labels Jan 12, 2022
@FeryET
Copy link
Author

FeryET commented Jan 12, 2022

seem to be duplicate to #543 if you still find this in need, feel free to reopen rabbit

Sorry if I'm reopening the issue, but I think this is at the very least an issue with the documentation. The way the micro option is computed is confusing. Can you explain to me how exactly these options work in torchmetrics? I read the documentation before using these but was sure micro was the flag I should have used.:-/ I write what I had thought first upon inspecting the documentations, so you could see why I was confused.

From what I know regardless of micro and macro, F1 and Accuracy should yield different results. In terms of micro, Accuracy should look at all samples and computes the portion of matching samples in the array in micro while in macro should compute class-wise accuracy and then average them (weighted or not). While F1 is predominantly a binary classification metric, and should compute recall and precision and average them. In a multiclass manner this should be done as a one vs rest mechanism and then averaged (or weighted average). So I don't understand the micro F1 vs macro F1 at all. Same thing with Precision and Recall.

@Borda Borda reopened this Jan 12, 2022
@Borda
Copy link
Member

Borda commented Jan 12, 2022

cc: @SkafteNicki @aribornstein

@dangne
Copy link

dangne commented Mar 22, 2022

Hi, I'm having the same issue on:

  • torch 1.9.1
  • torchmetrics 0.7.2

@SkafteNicki SkafteNicki added this to the v0.9 milestone Mar 23, 2022
@Waterkin
Copy link

Waterkin commented Apr 6, 2022

Hi, I'm having the same issue on: accuracy

@ma7dev
Copy link

ma7dev commented Apr 21, 2022

I have encountered this issue and here is a Colab notebook to replicate the issue and the solution.

I agree with @FeryET, the setup is confusing and it would be great if there is a warning or a better example to showcase the difference.

@SkafteNicki SkafteNicki modified the milestones: v0.9, v0.10 May 12, 2022
@rasbt
Copy link

rasbt commented Jun 3, 2022

Also, can we add a "binary" option for average so that we can compute the original recall score for binary classes?

Like

import sklearn.metrics as metrics
import numpy as np

a = np.array([1, 1, 0, 0, 0, 1])
b = np.array([0, 1, 1, 1, 0, 1])

metrics.recall_score(a, b, average='binary') # 0.6666666666666666

@griff4692
Copy link

I am also having this issue as well - is there a simple way to fix this?

@lucienwang1009
Copy link

lucienwang1009 commented Jun 17, 2022

Hi, I encountered a similar issue when using the Precision metric in MetricCollection. However, the output was always zero rather than consistent with other metrics.
Changing the compute_groups to False fixed my problem.
Hope it will be helpful.

@JamesLYC88
Copy link
Contributor

I encountered the same issue. As @lucienwang1009 said, initializing MetricCollection with compute_groups being false works. For example,

from torchmetrics import MetricCollection, Precision
MetricCollection(
    {'P@8': Precision(num_classes=8), 'P@15': Precision(num_classes=15)},
    compute_groups=False
)

Some detailed observations:

  • My results of P@8 and P@15 are correct on the validation data, but my values of P@8 and P@15 are exactly the same when testing on the testing set. I think that the bug might be related to the data.
  • When I only place one of [P@8, P@15] when inferencing on the testing data, the values are correct.
  • The compute_groups might group [P@8, P@15] and perform some incorrect operations that cause the problem.

The bug is not easy to be observed, and I take hours to check other places like data pre-processing, training and testing scripts, package versions, and so on.
I think this is a critical bug that needs to be solved as soon as possible, or simply, the default value of compute_groups should be false.

@SkafteNicki
Copy link
Member

Issue will be fixed by classification refactor: see this issue #1001 and this PR #1195 for all changes

Small recap: This issue describe that metric F1, Accuracy, Precision and Recall are all the same in the binary setting, which is wrong. The problem with the current implementation is that the metrics are calculated as average over the 0 and 1 class, which makes all the scores collapse into the same metric essentially.

Using the new binary_* versions of all the metrics:

from torchmetrics.functional import binary_accuracy, binary_precision, binary_recall, binary_f1_score
preds = tensor([0.4225, 0.5042, 0.1142, 0.4134, 0.0978, 0.1402, 0.9422, 0.4846, 0.1639, 0.6613])
target = tensor([1, 1, 1, 1, 1, 1, 1, 0, 1, 1])
binary_accuracy(preds, target) # tensor(0.4000)
binary_recall(preds, target) # tensor(0.3333)
binary_precision(preds, target) # tensor(1.)
binary_f1_score(preds, target) # tensor(0.5000)

which also corresponds to what sklearn is giving. Sorry for the confusion that this have given rise to.
Issue will be closed when #1195 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants