-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Test FIL probabilities with absolute error thresholds in python #3582
[REVIEW] Test FIL probabilities with absolute error thresholds in python #3582
Conversation
Below are the absolute error distributions collected in the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a clear win for clarity of the testing guarantees. The old rel thresholds were hard to reason about. Here, we are clarifying that we will test proba to a 3e-7 threshold or better in all cases. So I like it ;) Thanks!
I will wait for @canonizer ’s comment before merging though in case he as additional thoughts |
Thanks John! Could you also review #2894 ? Just don't merge it yet, so that we have a proper documentation of where changes came from. |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #3582 +/- ##
================================================
+ Coverage 35.24% 80.83% +45.58%
================================================
Files 378 227 -151
Lines 26910 17737 -9173
================================================
+ Hits 9485 14337 +4852
+ Misses 17425 3400 -14025
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@gpucibot merge |
Probabilities are limited between [0.0, 1.0]. Also, we generally care more about large probabilities which are
O(1/n_classes)
.The largest relative probability errors are usually caused by a small ground truth probability (e.g. 1e-3), as opposed to a large absolute error.
Hence, relative probability error is not the best metric. Absolute probability error is more relevant.
Moreover, absolute probability error is more stable, as relative errors have a long tail. When training or even inferring on many rows, the chance of getting a ground truth probability sized 1e-3 or 1e-4 grows. In some cases, there is no reasonable and reliable threshold. Last, if the number of predicted probabilities (clipped values) per input row grows, so does the long tail of relative probability errors, due to less undersampling. This unfairly compares binary classification with regression, and multiclass classification with binary classification.
The changes below are based on collecting absolute errors under
--run_unit
,--run_quality
and--run_stress
. These thresholds are violated at most a couple times per million samples, in most cases never.