Confidence intervals for classification performance metrics. Requires numpy and scipy.stats.
Function normal_approximation_binomial_confidence_interval should work for any metric computed as a ratio of events/observations, such as TPR, TNR, FPR, FNR, precision, or negative predictive value. It uses the normal approximation of a binomial confidence interval.
Function f1_score_confidence_interval does the same, but for the F1-score, with the propagated uncertainties of recall and precision, using an expression that I derived (confirmation appreciated!).
Here's an example for: true positives = 42, real positives = 63, predicted positives = 69.
ipython
Python 3.7.2 (default, Dec 29 2018, 06:19:36) Type 'copyright', 'credits' or 'license' for more information IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import propagation_confidence_interval as prop
In [2]: prop.normal_approximation_binomial_confidence_interval? Signature: prop.normal_approximation_binomial_confidence_interval(s, n, confidence_level=0.95) Docstring: Computes the binomial confidence interval of the probability of a success s, based on the sample of n observations. The normal approximation is used, appropriate when n is equal to or greater than 30 observations. The confidence level is between 0 and 1, with default 0.95. Returns [p_estimate, interval_range, lower_bound, upper_bound]. For reference, see Section 5.2 of Tom Mitchel's "Machine Learning" book. File: ~/stluc-experiments/scripts/propagation_confidence_interval.py Type: function
In [3]: recall_successes = 42
In [4]: recall_obs = 63
In [5]: [r, dr, r_upper_bound, r_lower_bound] = prop.normal_approximation_binomial_confidence_interval(recall_successes, recall_obs)
In [6]: r Out[6]: 0.6666666666666666
In [7]: dr Out[7]: 0.1164049796915108
In [8]: precision_successes = 42
In [9]: precision_obs = 69
In [10]: [p, dp, p_upper_bound, p_lower_bound] = prop.normal_approximation_binomial_confidence_interval(precision_successes, precision_obs)
In [11]: p Out[11]: 0.6086956521739131
In [12]: dp Out[12]: 0.1151545180912738
In [13]: prop.f1_score_confidence_interval? Signature: prop.f1_score_confidence_interval(r, p, dr, dp) Docstring: Computes the confidence interval for the F1-score measure of classification performance based on the values of recall (r), precision (p), and their respective confidence interval ranges, or absolute uncertainty, about the recall (dr) and the precision (dp). Disclaimer: I derived the formula myself based on f(r,p) = 2rp / (r+p). Nobody has revised my computation. Feedback appreciated! File: ~/stluc-experiments/scripts/propagation_confidence_interval.py Type: function
In [14]: prop.f1_score_confidence_interval(r, p, dr, dp) Out[14]: (0.6363636363636365, 0.18307035592733717, 0.4532932804362993, 0.8194339922909737)