-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Precision/Recall/F1 score compared to sklearn #3035
Comments
By the way, Precision/Recall/F1 scores are also off in Pytorch-lightning 0.8.5 |
i thought we tested against sklearn? |
@justusschock @SkafteNicki mind have look, pls 🐰 |
Its because we calculate the |
At some point we should probably support the different averaging methods that sklearn also have as one averaging method may be more meaningful in some cases (like very unbalanced datasets) |
I figured out the reason why this is a discrepancy: for binary classification, to recover sklearn, precision/recall/F1 should be done something like below:
where reduction by default is We can close the issue for now, but it would be really good to update the document to reflect these subtle differences. For multi-classes, I assume there will be more nuances between Lightning and Sklearn, given different ways of doing average ( |
@junwen-austin mind update it docs so we avoid similar questions in future... |
@Borda Yes I plan to do more testing on metrics if you do not mind and then update the docs so that we have more examples. Does this sound good to you? |
that would be perfect! |
🐛 Bug
Steps to reproduce the behavior:
from sklearn.metrics import f1_score as sklearn_f1
from pytorch_lightning.metrics import F1
import torch
# create sample label
y = torch.randint(high = 199,size = (210,))
print("dummy label/prediction")
print(y)
sk_macro_f1 = sklearn_f1(y.numpy(),y.numpy(),labels=list(range(200)),average = 'macro')
sk_macro_f1_tiny_batch = sklearn_f1(y[:10].numpy(),y[:10].numpy(),
labels=list(range(200)),average = 'macro')
sk_micro_f1 = sklearn_f1(y.numpy(),y.numpy(),labels=list(range(200)),average = 'micro')
pl_f1 = F1(200,reduction = "elementwise_mean")
pl_ele_f1 = pl_f1(y,y)
print(f"""sklearn macro f1:\t{sk_macro_f1}
sklearn macro f1 (tiny batch):\t{sk_macro_f1_tiny_batch}
skelarn micro f1:\t{sk_micro_f1}
pl_elementwise f1:\t{pl_ele_f1}
""") will output the following, while PL produce the macro f1 0.625, the tiny batch macro f1 is much worse, but the model predicted perfectly
|
@raynardj We are already tracking it in this issue and it will be part of our new aggregation system. However this may take a while to lay out. |
I'm also in the slack by the same user name, anything I can contribute to the matter? |
@raynardj if you want to help, please write to me on slack (username Nicki Skafte), as I already have some code ready that you could help finish :] |
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Code
Expected behavior
Precision/Recall/F1 results are expected to be consistent with those from sklearn.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source): PipAdditional context
The text was updated successfully, but these errors were encountered: