F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

FeryET · 2022-01-12T06:40:28Z

🐛 Bug

I am trying to report F1, Accuracy, Precision and Recall for a binary classification task. I have collected these metrics in a MetricCollection module, and run them for my train, val and test stages. Upon inspecting the results, I can see that all of these metrics are showing the exact same value.

To Reproduce

Create a random binary classification task and add these metrics together in a metric collection.

Code sample

I have uploaded a very minimal example in this notebook. As you can see the values reported by torchmetrics doesn't align with classification_report.

Expected behavior

F1, Precision, Recall and Accuracy should usually differ. It should be very unlikely to see all of them match exactly.

Environment

PyTorch Version (e.g., 1.0): 1.10.0
OS (e.g., Linux): Ubuntu 20.04.
How you installed PyTorch conda.
Build command you used (if compiling from source):
Python version: 3.9.
CUDA/cuDNN version: None.
GPU models and configuration: None.
Any other relevant information: None.

Additional context

I have also asked this question in a discussion form yesterday thinking it was a problem on my part, but after looking up the sitaution, I think this might be a bug.

#743

The text was updated successfully, but these errors were encountered:

github-actions · 2022-01-12T06:41:11Z

Hi! thanks for your contribution!, great first issue!

Borda · 2022-01-12T11:14:29Z

seem to be duplicate to #543 if you still find this in need, feel free to reopen 🐰

FeryET · 2022-01-12T11:32:20Z

seem to be duplicate to #543 if you still find this in need, feel free to reopen rabbit

Sorry if I'm reopening the issue, but I think this is at the very least an issue with the documentation. The way the micro option is computed is confusing. Can you explain to me how exactly these options work in torchmetrics? I read the documentation before using these but was sure micro was the flag I should have used.:-/ I write what I had thought first upon inspecting the documentations, so you could see why I was confused.

From what I know regardless of micro and macro, F1 and Accuracy should yield different results. In terms of micro, Accuracy should look at all samples and computes the portion of matching samples in the array in micro while in macro should compute class-wise accuracy and then average them (weighted or not). While F1 is predominantly a binary classification metric, and should compute recall and precision and average them. In a multiclass manner this should be done as a one vs rest mechanism and then averaged (or weighted average). So I don't understand the micro F1 vs macro F1 at all. Same thing with Precision and Recall.

Borda · 2022-01-12T11:46:58Z

cc: @SkafteNicki @aribornstein

dangne · 2022-03-22T03:56:36Z

Hi, I'm having the same issue on:

torch 1.9.1
torchmetrics 0.7.2

Waterkin · 2022-04-06T13:05:25Z

Hi, I'm having the same issue on: accuracy

ma7dev · 2022-04-21T08:27:16Z

I have encountered this issue and here is a Colab notebook to replicate the issue and the solution.

I agree with @FeryET, the setup is confusing and it would be great if there is a warning or a better example to showcase the difference.

rasbt · 2022-06-03T16:23:39Z

Also, can we add a "binary" option for average so that we can compute the original recall score for binary classes?

Like

import sklearn.metrics as metrics
import numpy as np

a = np.array([1, 1, 0, 0, 0, 1])
b = np.array([0, 1, 1, 1, 0, 1])

metrics.recall_score(a, b, average='binary') # 0.6666666666666666

griff4692 · 2022-06-09T02:11:02Z

I am also having this issue as well - is there a simple way to fix this?

lucienwang1009 · 2022-06-17T08:24:49Z

Hi, I encountered a similar issue when using the Precision metric in MetricCollection. However, the output was always zero rather than consistent with other metrics.
Changing the compute_groups to False fixed my problem.
Hope it will be helpful.

JamesLYC88 · 2022-07-20T17:05:36Z

I encountered the same issue. As @lucienwang1009 said, initializing MetricCollection with compute_groups being false works. For example,

from torchmetrics import MetricCollection, Precision
MetricCollection(
    {'P@8': Precision(num_classes=8), 'P@15': Precision(num_classes=15)},
    compute_groups=False
)

Some detailed observations:

My results of P@8 and P@15 are correct on the validation data, but my values of P@8 and P@15 are exactly the same when testing on the testing set. I think that the bug might be related to the data.
When I only place one of [P@8, P@15] when inferencing on the testing data, the values are correct.
The compute_groups might group [P@8, P@15] and perform some incorrect operations that cause the problem.

The bug is not easy to be observed, and I take hours to check other places like data pre-processing, training and testing scripts, package versions, and so on.
I think this is a critical bug that needs to be solved as soon as possible, or simply, the default value of compute_groups should be false.

SkafteNicki · 2022-08-28T11:45:47Z

Issue will be fixed by classification refactor: see this issue #1001 and this PR #1195 for all changes

Small recap: This issue describe that metric F1, Accuracy, Precision and Recall are all the same in the binary setting, which is wrong. The problem with the current implementation is that the metrics are calculated as average over the 0 and 1 class, which makes all the scores collapse into the same metric essentially.

Using the new binary_* versions of all the metrics:

from torchmetrics.functional import binary_accuracy, binary_precision, binary_recall, binary_f1_score
preds = tensor([0.4225, 0.5042, 0.1142, 0.4134, 0.0978, 0.1402, 0.9422, 0.4846, 0.1639, 0.6613])
target = tensor([1, 1, 1, 1, 1, 1, 1, 0, 1, 1])
binary_accuracy(preds, target) # tensor(0.4000)
binary_recall(preds, target) # tensor(0.3333)
binary_precision(preds, target) # tensor(1.)
binary_f1_score(preds, target) # tensor(0.5000)

which also corresponds to what sklearn is giving. Sorry for the confusion that this have given rise to.
Issue will be closed when #1195 is merged.

FeryET added bug / fix Something isn't working help wanted Extra attention is needed labels Jan 12, 2022

Borda closed this as completed Jan 12, 2022

Borda added question Further information is requested working as intended and removed bug / fix Something isn't working help wanted Extra attention is needed labels Jan 12, 2022

Borda reopened this Jan 12, 2022

SkafteNicki added this to the v0.9 milestone Mar 23, 2022

SkafteNicki modified the milestones: v0.9, v0.10 May 12, 2022

Borda mentioned this issue Jun 25, 2022

accuracy, recall, precision and f1-score are equal #1111

Closed

SkafteNicki mentioned this issue Aug 28, 2022

Classification Refactor [rebase & merge] #1195

Merged

6 tasks

Borda closed this as completed in #1195 Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

FeryET commented Jan 12, 2022 •

edited

Loading

github-actions bot commented Jan 12, 2022

Borda commented Jan 12, 2022

FeryET commented Jan 12, 2022 •

edited

Loading

Borda commented Jan 12, 2022

dangne commented Mar 22, 2022

Waterkin commented Apr 6, 2022

ma7dev commented Apr 21, 2022

rasbt commented Jun 3, 2022

griff4692 commented Jun 9, 2022

lucienwang1009 commented Jun 17, 2022 •

edited

Loading

JamesLYC88 commented Jul 20, 2022

SkafteNicki commented Aug 28, 2022

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

Comments

FeryET commented Jan 12, 2022 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Jan 12, 2022

Borda commented Jan 12, 2022

FeryET commented Jan 12, 2022 • edited Loading

Borda commented Jan 12, 2022

dangne commented Mar 22, 2022

Waterkin commented Apr 6, 2022

ma7dev commented Apr 21, 2022

rasbt commented Jun 3, 2022

griff4692 commented Jun 9, 2022

lucienwang1009 commented Jun 17, 2022 • edited Loading

JamesLYC88 commented Jul 20, 2022

SkafteNicki commented Aug 28, 2022

FeryET commented Jan 12, 2022 •

edited

Loading

FeryET commented Jan 12, 2022 •

edited

Loading

lucienwang1009 commented Jun 17, 2022 •

edited

Loading