Merge pull request #537 from carpentries-incubator/bias_evaluation

Bias evaluation
carpentries-incubator · Jan 14, 2025 · 69d0940 · 69d0940
2 parents cf5996c + ce7d8cc
commit 69d0940
Show file tree

Hide file tree

Showing 3 changed files with 75 additions and 2 deletions.
diff --git a/episodes/6-outlook.Rmd b/episodes/6-outlook.Rmd
@@ -39,8 +39,8 @@ In short, the deep learning problem is that of finding out how similar two molec
 based on their mass spectrum.
 You can compare this to comparing two pictures of animals, and predicting how similar they are.
 
-A siamese neural network is used to solve the problem.
-In a siamese neural network you have two input vectors, let's say two images of animals or two mass spectra.
+A Siamese neural network is used to solve the problem.
+In a Siamese neural network you have two input vectors, let's say two images of animals or two mass spectra.
 They pass through a base network. Instead of outputting a class or number with one or a few output neurons, the output layer
 of the base network is a whole vector of for example 100 neurons. After passing through the base network, you end up with two of these
 vectors representing the two inputs. The goal of the base network is to output a meaningful representation of the input (this is called an embedding).
@@ -87,6 +87,33 @@ in this course. This is quite common for applied deep learning projects. It is s
 deep learning problem is spent on data preparation, and only 10% on modeling!
 :::
 
+::: discussion
+## Bias and Evaluation
+
+Bias has been discussed in the context of machine learning, deep learning and artificial intelligence frequently and on various levels.
+That is because there are many aspects to bias.
+One the one hand, bias is very technical: a model can be biased towards certain classes or certain features.
+On the other hand, this can have very practical and severe impact on the users of a such a model;
+for instance when it comes to misclassification in relation to color of the skin or geographical location.
+
+If such biases are reflected in a dataset that is used for model validation and testing, you might not be able to see them.
+In order to get an evaluation that is representative for the diversity found in the real world, it is therefore important to use a test set that reflects this diversity as much as possible.
+
+The need for such a dataset as opposed to existing datasets that mostly presumed Western standards has been one of the motivations for creating the Dollar Street Dataset -- and why we have used it in this lesson.
+The creators [have shown](https://papers.nips.cc/paper_files/paper/2022/hash/5474d9d43c0519aa176276ff2c1ca528-Abstract-Datasets_and_Benchmarks.html) that more diversity in a training dataset can contribute to significant model improvements.
+A model trained on a more diverse dataset is more robust against unexpected occurrences.
+
+Therefore, it is important to fully understand the quantitative evaluation of a new model:
+it reflects the model's performance on the test set, but it does not say anything about how well that dataset represents the real world.
+Also be aware that such matters can be related to racism and other forms of discrimination.
+Depending on the use case, diversity can also refer to imbalance on other, more subtle and less sensitive dimensions.
+
+**Discuss the following statement with your neighbors:**
+
+- What forms of bias and data imbalance can you think of?
+- How would they affect the performance of a deep learning model?
+:::
+
 ::: discussion
 ## Discussion: Large Language Models and prompt engineering
 Large Language Models (LLMs) are deep learning models that are able to perform general-purpose language generation.

diff --git a/paper.bib b/paper.bib
@@ -150,3 +150,43 @@ @software{Pollard_Introduction_to_artificial_2022
     version = {0.1.0},
     year = {2022}
 }
+
+@misc{horst_allisonhorstpalmerpenguins_2020,
+  title     = {allisonhorst/palmerpenguins: v0.1.0},
+  url       = {https://doi.org/10.5281/zenodo.3960218},
+  publisher = {Zenodo},
+  author    = {Horst, Allison M. and Hill, Alison Presmanes and Gorman, Kristen B.},
+  month     = jul,
+  year      = {2020},
+  doi       = {10.5281/zenodo.3960218}
+}
+
+@misc{huber_weather_2022,
+	title = {Weather prediction dataset},
+	copyright = {Creative Commons Attribution 4.0 International, Open Access},
+	url = {https://zenodo.org/record/4770936},
+	doi = {10.5281/ZENODO.4770936},
+	abstract = {Dataset created for machine learning and deep learning training and teaching purposes.{\textless}br{\textgreater} It can, for instance, be used for classification, regression, and forecasting tasks.{\textless}br{\textgreater} Complex enough to demonstrate realistic issues such as overfitting and unbalanced data, while still remaining intuitively accessible. {\textless}strong{\textgreater}Description and units of weather features:{\textless}/strong{\textgreater} Data includes the following features/variables for several European cities: Feature (type) Column name Description Physical Unit mean temperature \_temp\_mean mean daily temperature in 1 °C max temperature \_temp\_max max daily temperature in 1 °C min temperature \_temp\_min min daily temperature in 1 °C cloud\_cover \_cloud\_cover cloud cover oktas global\_radiation \_global\_radiation global radiation in 100 W/m2 humidity \_humidity humidity in 1 \% pressure \_pressure pressure in 1000 hPa precipitation \_precipitation daily precipitation in 10 mm sunshine \_sunshine sunshine hours in 0.1 hours wind\_speed \_wind\_gust wind gust in 1 m/s wind\_gust \_wind\_speed wind speed in 1 m/s {\textless}strong{\textgreater}File descriptions{\textless}/strong{\textgreater} {\textless}code{\textgreater}weather\_prediction\_dataset.csv{\textless}/code{\textgreater} - Main data file, tabular data, comma-separated CSV. Contains the data for different weather features (daily observations, see below for more details) for 18 European cities or places through the years 2000 to 2010. {\textless}code{\textgreater}weather\_prediction\_picnic\_labels.csv{\textless}/code{\textgreater} - Optional data to be used as potential labels for classification tasks. Contains booleans to characterize the daily weather conditions as suitable for a picnic (True) or not (False) for all 18 locations in the dataset. {\textless}code{\textgreater}weather\_prediction\_dataset\_map.png{\textless}/code{\textgreater}- Simple map showing all 18 locations in Europe. {\textless}code{\textgreater}metadata.txt{\textless}/code{\textgreater} - Further information on the dataset, the data processing, and conversion, as well as the description and units of all weather features. ORIGINAL DATA TAKEN FROM: EUROPEAN CLIMATE ASSESSMENT \&amp; DATASET (ECA\&amp;D), file created on 22-04-2021{\textless}br{\textgreater} THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED: Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface{\textless}br{\textgreater} air temperature and precipitation series for the European Climate Assessment.{\textless}br{\textgreater} Int. J. of Climatol., 22, 1441-1453.{\textless}br{\textgreater} Data and metadata available at http://www.ecad.eu For more information see metadata.txt file.{\textless}br{\textgreater} The dataset has also been presented at the Teaching Machine Learning Workshop at ECML 2022: https://teaching-ml.github.io/2022/. The Python code used to create the weather prediction dataset from the ECA\&amp;D data can be found on GitHub: https://github.com/florian-huber/weather\_prediction\_dataset{\textless}br{\textgreater} (this repository also contains Jupyter notebooks with teaching examples) Versions: {\textless}strong{\textgreater}v5{\textless}/strong{\textgreater}: updated metadata.txt file. {\textless}strong{\textgreater}v4{\textless}/strong{\textgreater}: to be more future proof in times of climate change/crisis --\&gt; "BBQ weather" prediction is now "picnic weather" prediction. Data itself remains unchanged. {\textless}strong{\textgreater}v3{\textless}/strong{\textgreater}: added "light" version of the dataset with less features (only 11 locations and fewer variables, reduction from 163 to 89 features) --\&gt; This is meant to be used if training times for hands-on session is becoming an issues {\textless}strong{\textgreater}v2{\textless}/strong{\textgreater}: now also contains additional `BBQ\_weather` labels, the dataset itself has not changed between versions v1 and v2},
+	language = {en},
+	urldate = {2025-01-14},
+	publisher = {Zenodo},
+	author = {Huber, Florian and van Kuppevelt, Dafne and Steinbach, Peter and Sauze, Colin and Liu, Yang and Weel, Berend},
+	month = sep,
+	year = {2022},
+	keywords = {machine learning, deep learning, training data, teaching material},
+}
+
+@article{gaviria_rojas_dollar_2022,
+  title      = {The {Dollar} {Street} {Dataset}: {Images} {Representing} the {Geographic} and {Socioeconomic} {Diversity} of the {World}},
+  volume     = {35},
+  shorttitle = {The {Dollar} {Street} {Dataset}},
+  url        = {https://papers.nips.cc/paper_files/paper/2022/hash/5474d9d43c0519aa176276ff2c1ca528-Abstract-Datasets_and_Benchmarks.html},
+  language   = {en},
+  urldate    = {2025-01-14},
+  journal    = {Advances in Neural Information Processing Systems},
+  author     = {Gaviria Rojas, William and Diamos, Sudnya and Kini, Keertan and Kanter, David and Janapa Reddi, Vijay and Coleman, Cody},
+  month      = dec,
+  year       = {2022},
+  pages      = {12979--12990},
+  file       = {Full Text PDF:/Users/carstenschnober/Zotero/storage/PJZDNZTV/Gaviria Rojas et al. - 2022 - The Dollar Street Dataset Images Representing the.pdf:application/pdf}
+}
diff --git a/paper.md b/paper.md
@@ -87,6 +87,12 @@ implement a basic deep learning model in Python with Keras,
 monitor and troubleshoot the training process, and implement different layer types, 
 such as convolutional layers.
 
+We use data with permissive licenses and designed for real world use cases:
+
+- The Penguin dataset (@horst_allisonhorstpalmerpenguins_2020)
+- The Weather prediction dataset (@huber_weather_2022)
+- The Dollar Street Dataset (@gaviria_rojas_dollar_2022) is representative and contains accurate demographic information to ensure their robustness and fairness, especially for smaller subpopulations.
+
 # Statement of Need
 There are many free online course materials on deep learning, 
 see for example: @noauthor_fastai_nodate; @noauthor_udemy_nodate; @noauthor_udemy_nodate-1; @noauthor_udemy_nodate-2; @noauthor_coursera_nodate; @noauthor_freecodecamporg_2022.