This document summarizes the included datasets (uploaded to Stanford Digital Repository) and code (located in this repository) to replicate and reproduce tables in Hwang et al.'s (2023) paper "Curating Training Data for Reliable Large-Scale Visual Data Analysis: Lessons from Identifying Trash in Street View Imagery". These materials are intended for users to adapt a similar pipeline to other outcomes or data from other places or utilize our data for other research.
Data are stored in Stanford Digital Repository here.
single_image.csv: This dataset includes all of the images that have single image ratings from Mechanical Turk or coding sessions and all images with Trueskill ratings or ML predictions. Each row corresponds to an image, with columns indicating the answer provided in each of these surveys and/or the relevant Trueskill rating and ML prediction.
pairs.csv: This dataset includes all of the images that were rated as pairs in Mechanical Turk or coding sessions. Each row corresponds to a pair of images.
For both datasets, the accompanying codebook (SMR_Data_Codebook.xlsx) details each column’s values and their meanings.
Note: image names do not indicate the location, but users can request the associated location from the authors.
results-input.Rmd uses single_image.csv and pairs.csv to generate all of the tables and figures in the paper.
For similar datasets but other input data, results-anydata.Rmd generates generic reliability measures.
PCA results and cosine similarity for images with discrepancies, as described in the paper, are generated using pca_image_feature_analysis.ipynb.
Class Activation Maps (CAMs) are generated using cams.py.
Discrepancies input for PCA/cosine similarity/CAMs can be generated by user from single_image.csv.