Replication package for the paper "Governing the Commons: Code Ownership and Code Clones in Large Scale Software Development"
-
Root directory Contains a Dockerfile that can be used to build a reproducible Docker image, including the full, anonymized, data set used in the paper, and the complete Renv file specifying the complete R environment.
-
analysis/ownership
Contains a number of RMarkdown files (named 01_xxx, 02_xxx, and so on), that deals with different aspects (EDA, modeling) of the analysis. All figures in the paper are generated from the file99_article_illustrations.html
-
analysis/.cache
Contains precomputed R data objects (models, reloo results) that is used to calculate results, unless you want to redo the analysis (see below in the CACHE section). -
analysis/output
Contains HTML and PDF files that were generated from the RMarkdown in the docker image. These files are not included in the docker image - to generate your own versions of them, you bind a directory inside the docker image to a directory on the machine where you run the docker image (see below).
The image has been built on an Ubuntu 22.04 system (x64 Linux), with 64 GiB internal memory and 21 GiB swap. Other steps might be needed on other architectures or OSes.
-
Build the docker image via:
docker build -t ownership-commons .
-
Output files are created in
/home/app/ownership/output
, and models are cached in/home/app/ownership/.cache
. These directories should be mapped to some directory on your local system. First create the needed local directories, and make them available for the docker image:mkdir -p ${PWD}/ownership/.cache && mkdir -p ${PWD}/ownership/output && chmod -R a+rwx ${PWD}/ownership
-
Run the image, and mount these directories to the container:
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership ownership-commons
In case you only want to run some RMarkdown files, a prefix can be specified via a docker environment variable:
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership -e PREFIX=02 ownership-commons
Standard shell globbing rules can be used (but beware to escape them from the regular shell used to start docker.
In case you want to run with a separately stored cache directory, use the CACHE environment variable, and make sure that it is mounted into the container. For instance:
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership -e PREFIX=02 -e CACHE="../ownership/.cache" ownership-commons
Setting the RELOO
environment variable to TRUE
will run the reloo function,
which performs exact Leave-One-Out Cross-Validation once per datapoint signaled as potentially influential by the regular loo
function.
The result of the reloo will be stored in the output directory, where it can be further analysed (e.g. plotted).
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership -e RELOO=TRUE -e PREFIX=02 -e CACHE="../ownership/.cache" ownership-commons
Beware that running reloo
might take several days to complete, so best done over long weekends, or on a machine which you do not interact with on a daily basis.
The output of the docker run
command will tell if it is performing the reloo
step, and the state of the models can be inspected via the docker top
command.
user@host:> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6ee210bac03e ownership-commons "/home/app/render_al…" 46 minutes ago Up 46 minutes vigorous_chebyshev
user@host:> docker top 6ee210bac03e
UID PID PPID C STIME TTY TIME CMD
2000 56576 56554 0 11:49 ? 00:00:00 /bin/sh /home/app/render_all.sh
2000 56618 56576 17 11:49 ? 00:08:18 /usr/local/lib/R/bin/exec/R --no-save --no-restore -e rmarkdown::render("analysis/02_intercept_only_model.Rmd",~+~params=list(cache="../ownership/.cache",~+~reloo=TRUE),~+~output_dir="ownership/output")
2000 63704 56618 99 12:22 ? 00:49:00 ./model_87faeed5fafcb73fb4487a98a2060430 id=3 random seed=189062706 data file=/tmp/RtmpyTRE1T/standata-969aeb6b8.json output file=/tmp/RtmpyTRE1T/model_87faeed5fafcb73fb4487a98a2060430-202403251058-3-56a7b2.csv refresh=0 profile_file=/tmp/RtmpyTRE1T/model_87faeed5fafcb73fb4487a98a2060430-profile-202403251058-3-67b236.csv method=sample num_samples=3000 num_warmup=1000 save_warmup=0 thin=1 algorithm=hmc engine=nuts adapt delta=0.95 engaged=1
2000 63746 56618 99 12:22 ? 00:48:25 ./model_87faeed5fafcb73fb4487a98a2060430 id=4 random seed=189062706 data file=/tmp/RtmpyTRE1T/standata-969aeb6b8.json output file=/tmp/RtmpyTRE1T/model_87faeed5fafcb73fb4487a98a2060430-202403251058-4-56a7b2.csv refresh=0 profile_file=/tmp/RtmpyTRE1T/model_87faeed5fafcb73fb4487a98a2060430-profile-202403251058-4-67b236.csv method=sample num_samples=3000 num_warmup=1000 save_warmup=0 thin=1 algorithm=hmc engine=nuts adapt delta=0.95 engaged=1
The number of parallelly executing models is controlled via the CORES
environment, which defaults to 2.
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership -e CORES=2 -e PREFIX=02 -e CACHE="../ownership/.cache" ownership-commons
The number of threads in each model is controlled via the THREADS
environment, which defaults to 4.
docker run --mount type=bind,source=${PWD}/ownership,target=/home/app/ownership -e THREADS=4 -e PREFIX=02 -e CACHE="../ownership/.cache" ownership-commons
The number of physical CPUs (i.e. ignoring hyperthreads and fake CPUs) should be equal or less than CORES*THREADS
, as Stan and brms are both quite CPU-intensive.