-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classification metrics overhaul: stat scores (3/n) #4839
Classification metrics overhaul: stat scores (3/n) #4839
Conversation
Hello @tadejsv! Thanks for updating this PR.
Comment last updated at 2020-12-30 18:58:06 UTC |
@tadejsv @justusschock @SkafteNicki how is it going here? :] |
@Borda @SkafteNicki @justusschock @teddykoker @rohitgr7 This is ready for (re)review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reading...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM... Great work!!!!
I'd recommend waiting for other reviewers before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job as always :]
This PR is a spin-off from #4835, based on new input formatting from #4837
This will provide a basis for future PRs for recall, precision, fbeta and iou metrics.
What does this PR do?
top_k
parameter for input formatting now also works with multi-label inputsThis was done so that StatScores can also provide a basis for Recall@K and Precision@K later - because these two metrics always take multi-label inputs, and count the top K highest probability predictions as True. For multi-class inputs this parameter works as before.
This addition was done in the input formatting function. This means that multi-label inputs can now be binarized in two ways: through the threshold parameter, or through the top_k parameter. I have decided to give the top_k parameter preference if both are set.
For Top-K Accuracy multi-label inputs don't make sense (or at least I have not seen any use of it), so I have updated the Accuracy metric so that an error is raised if top_k is used with multi-label inputs.
New StatScores metric (and updated functional counterpart)
Computes stat score, i.e. true positives, false positives, true negatives, false negatives. It is used as a base for many other metrics (recall, precision, fbeta, iou). It is made to work with all types of inputs, and is very configurable. There are two main parameters here:
reduce
: This determines how should the statistics be counted: globally (summing across all labels), by calsses, or by samples. The possible values (micro
,macro
,samples
), correspond to averaging names for metrics such as precision. This is "inspired" by sklearn's averaging argument in such metrics.mdmc_reduce
: In case of multi-dimensional multi-class (mdmc) inputs, how should the statistics be reduced? This is on top of thereduce
argument. The possible values areglobal
(i.e. extra dimensions are actually sample dimensions) andsamplewise
(compute statistics for each sample, taking the extra dimensions as a sample-within-sample dimension).Why? The reason for these two options (right now PL metrics implements the
global
option by default) is that in some "downstream" metrics, such as iou, it is, in my opinion, much more natural to compute the metric per sample, and then average accross samples, rather than join everyhing into one "blob", and compute the averages for this blob. For example, if you are doing image segmentation, it makes more sense to compute the metrics per image, as the model is trained on images, and not blobs :) Also, aggregation of everything may disguise some unwanted behavior (such as inability to predict a minority class), which would be evident if averaging was done per sample (samplewise
).Also, this class metric (and the functional equivalent) now return the stat scores concatenated in a single tensor, instead of returning a tuple. I did this because the standard metrics testing framework in PL does not support non-tensor returns - and the change should be minor for the users.
I have deprecated the
stat_scores_multiple_classes
metric, asstat_scores
is now perfectly capable of handling multiple classes itself.Documentation
Second part of "Input types" section with examples of the use of
is_multiclass
parameter withStatScores
is added.