Authors: Victor Ojewale, Ryan Steed, Briana Vecchione, Abeba Birhane, Deb Raji
Developers: Ryan Steed, Victor Ojewale
This tool powers the landscape visualization and data analysis for the Open Source Audit Tooling project, including results in our paper, "Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling."
make venv
All code used to analyze our database and produce the results in our paper can be found in analysis.Rmd
, drawing on code from utils.R
. All analysis requires a cleaned and pre-processed version of our Airtable database, produced using the instructions below. data/airtable.csv
contains the most recent copy of our Airtable database.
- [Optional] Download most recent Airtable CSV ->
data/airtable.csv
. - [Optional] Obtain access to Crunchbase and Github data using the instructions below. If you do not have access to Crunchbase data or a Github personal access token, you can skip this data with the options
--no-crunchbase
or--no-github
. You will also need to skip code chunks requiring variables beginning withgh_
orcb_
inanalysis.Rmd
. - Run
python clean.py pivot data/airtable.csv
. This endpoint explodes theTaxonomy
field into three levels, cleans the data, and joins in Github and Crunchbase data. Cleaned output is stored inoutput/airtable_for_pivot.csv
.
Add your Github personal access token to secrets.json
, with permission to read public repositories. The file should look like this:
{
"github_token": "YOUR_TOKEN"
}
clean.py pivot
will scrape from Github using this token.
Crunchbase data
Crunchbase does not allow us to redistribute the data we used for our analysis. To obtain a copy of your own, request access to the Crunchbase Research Access program or buy a subscription to Crunchbase Pro.
Note that you may need to change variable names in analysis.Rmd
if the Crunchbase schema has changed.
- Create a file of organization names and domains needed, based on
data/airtable.csv
(e.g.,crunchbase/cb_query.csv
). - Use Crunchbase's import function to create a List of those organizations.
- Download all available columns to
crunchbase
, creating separate files (crunchbase/cb-query_*.csv
) forcompanies
,investors
, andschools
.
With research access, download the Daily CSV Export. Using sqlite3, import organizations.csv
:
.mode csv
.import PATH_TO_CSV organizations
.save organizations.db
Then, use the flag --from-sql
in calls to clean.py pivot
.
To generate YAML for the landscape visualization,
- [Optional] Download most recent Airtable CSV ->
data/airtable.csv
. - Run
python clean.py yaml data/airtable.csv ../landscape.yml
.
For Mac, you may need to additionally install graphviz
:
brew install graphviz
pip install --global-option=build_ext --global-option="-I/usr/local/Cellar/graphviz/8.0.5/include/" --global-option="-L/usr/local/Cellar/graphviz/8.0.5/lib/" pygraphviz
clean.py
: Script for cleaning and joining data from Airtable, Github, and Crunchbase.clean.py pivot
: Clean and join data for analysis.clean.py yaml
: Clean and join data used to generate landscape visualization.
analysis.Rmd
: R Markdown file for generating plots and results used in our paper.utils.R
: R utility functions for generating plots and results used in our paper.data/
airtable.csv
: Most recent copy of our Airtable database.taxonomy.json
: JSON copy of our taxonomy tree.
output/
airtable_for_pivot.csv
: Cleaned and joined data foranalysis.Rmd
.landscape.yml
: YAML for landscape visualization.
crunchbase/
: Directory for storing Crunchbase data and (optionally) Crunchbase query file.