update fairness episode

carpentries-incubator · Dec 21, 2023 · 426772a · 426772a
1 parent 31bee32
commit 426772a
Showing 1 changed file with 59 additions and 6 deletions.
diff --git a/episodes/3-model-eval.md b/episodes/3-model-eval.md
@@ -23,7 +23,34 @@ exercises: 0
 
 Stakeholders often want to know the accuracy of a machine learning model -- what percent of predictions are correct? Accuracy can be decomposed into further metrics: e.g., in a binary prediction setting, recall (the fraction of positive samples that are classified correctly) and precision (the fraction of samples classified as positive that actually are positive) are commonly-used metrics. 
 
-**TO DO** put table (confusion matrix) here with metrics and discuss what different entries mean
+Suppose we have a model that performs binary classification (+, -) on a test dataset of 1000 samples (let $n$=1000). A *confusion matrix* defines how many predictions we make in each of four quadrants: true positive with positive prediction (++), true positive with negative prediction (+-), true negative with positive prediction (-+), and true negative with negative prediction (--).
+
+|             | True + | True -
+| Predicted + |  300   |   80
+| Predicted - |   25   |  595
+
+So, for instance, 80 samples have a true class of + but get predicted as members of -. 
+
+We can compute the following metrics:
+* Accuracy: What fraction of predictions are correct?
+  * (300 + 595) / 100 = 0.895
+  * Accuracy is 89.5%
+* Precision: What fraction of predicted positives are true positives?
+  * 300 / (300 + 80) = 0.789
+  * Precision is 78.9%
+* Recall: What fraction of true positives are classified as positive?
+  * 300 / (300 + 25) = 0.923
+  * Recall is 92.3%
+
+:::::::::::::::::::::::::::::::::::::::::: callout
+
+TODO we've discussed binary classification but for other types of tasks there are different metrics. E.g., top-k accuracy for multi-class problems, ROC for regression tasks
+
+TODO also discuss F1 score here?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+
 
 :::::::::::::::::::::::::::::::::::::: challenge
 
@@ -54,19 +81,19 @@ Different accuracy metrics may be more relevant in different situations. Discuss
 
 What does it mean for a machine learning model to be fair? There is no single definition of fairness, adn it stems beyond data, model internals, and model output to how a model is deployed in practice. But the aggregate model outputs can be used to gain an overall understanding of how models behave with respect to different demographic groups -- an approach called group fairness. 
 
-In general, if there are no differences between groups, achieving fairness is easy. But, in practice, in many social settings wehre prediction tools are used, there are differences between groups, e.g., due to historical and current discrimination. 
+In general, if there are no differences between groups, achieving fairness is easy. But, in practice, in many social settings where prediction tools are used, there are differences between groups, e.g., due to historical and current discrimination. 
 
-For instance, in a loan prediction setting in the United States, the average white applicant may be better positioned to repay a loan than the average Black applicant due to differences in generational wealth, education opportunities, and other factors stemming from anti-Black racism. If, say, 50% of white applicants are granted a loan, with a precision of 90% and a recall of 70% -- in other words, 90% of white people granted loans end up repaying them, and 70% of all people who would have repayed the loan, if given the opportunity, get the loan. Consider the following scenarios:
+For instance, in a loan prediction setting in the United States, the average white applicant may be better positioned to repay a loan than the average Black applicant due to differences in generational wealth, education opportunities, and other factors stemming from anti-Black racism. If, say, 50% of white applicants are granted a loan, with a precision of 90% and a recall of 70% -- in other words, 90% of white people granted loans end up repaying them, and 70% of all people who would have repaid the loan, if given the opportunity, get the loan. Consider the following scenarios:
 
-* (Demographic parity) We give loans to 50% of Black applicants in a way that maximizes overall accurcy
+* (Demographic parity) We give loans to 50% of Black applicants in a way that maximizes overall accuracy
 * (Equalized odds) We give loans to X% of Black applicants, where X is chosen to maximize accuracy subject to keeping precision equal to 90%. 
 * (Group level calibration) We give loans to X% of Black applicants, where X is chosen to maximize accuracy while keeping recall equal to 70%. 
 
-There are *many* notions of statistical group fainress, but most boil down to one of the three above options: demographic parity, equalized  odds, and group-level calibration.
+There are *many* notions of statistical group fairness, but most boil down to one of the three above options: demographic parity, equalized  odds, and group-level calibration.
 
 **TODO** need example here, case study
 
-**TODO** need discussion of individual fainress (especially if we keep the challenge below)
+**TODO** need discussion of individual fairness (especially if we keep the challenge below)
 
 :::::::::::::::::::::::::::::::::::::: challenge
 
@@ -95,8 +122,34 @@ A - 3, B - 2, C - 4, D - 1
 
 ## Fairness in generative AI
 
+Generative models learn from statistical patterns in real-world data.
+
+### Natural language
+TODO example machine translation
+
+### Image generation
+TODO 
 
 ## Improving fairness of models
+Reweighting TODO talk through with a simple example?
+
+Post-processing change cutoffs TODO
+
+:::::::::::::::::::::::::::::::::::::: challenge
+
+### Matching fairness terminology with definitions
+
+Computer scientists have proposed many interventions to improve the fairness of machine learning models. A partial list is available (TODO FIND RESOURCE). Visit the list and read about one of the methods. When would using that method be beneficial for fairness? How does it compare to the techniques we talked about above?
+
+::::::::::::::::::::::::::::::::::::::::::::::::::
+
+:::::::::::::: solution
+
+### Solution
+
+TODO (discuss one or two?)
+
+:::::::::::::::::::::::::
 
 
 ::::::::::::::::::::::::::::::::::::: keypoints