-
-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathgood-enough-practices-for-scientific-computing.tex
1711 lines (1400 loc) · 70.7 KB
/
good-enough-practices-for-scientific-computing.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in]{geometry}
% amsmath and amssymb packages, useful for mathematical formulas and symbols
\usepackage{amsmath,amssymb}
% Use adjustwidth environment to exceed column width (see example table in text)
\usepackage{changepage}
% Use Unicode characters when possible
\usepackage[utf8x]{inputenc}
% textcomp package and marvosym package for additional characters
\usepackage{textcomp,marvosym}
% cite package, to clean up citations in the main text. Do not remove.
\usepackage{cite}
% Use nameref to cite supporting information files (see Supporting Information section for more info)
\usepackage{nameref,hyperref}
% line numbers
\usepackage[right]{lineno}
% ligatures disabled
\usepackage{microtype}
\DisableLigatures[f]{encoding = *, family = * }
% color can be used to apply background shading to table cells only
\usepackage[table]{xcolor}
% array package and thick rules for tables
\usepackage{array}
% enumerate package lets us use letters instead of numbers
\usepackage{enumerate}
% create "+" rule type for thick vertical lines
\newcolumntype{+}{!{\vrule width 2pt}}
% create \thickcline for thick horizontal lines of variable length
\newlength\savedwidth
\newcommand\thickcline[1]{%
\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\cline{#1}%
\noalign{\vskip\arrayrulewidth}%
\noalign{\global\arrayrulewidth\savedwidth}%
}
% \thickhline command for thick horizontal lines that span the table
\newcommand\thickhline{\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\hline
\noalign{\global\arrayrulewidth\savedwidth}}
% Remove comment for double spacing
%\usepackage{setspace}
%\doublespacing
% Text layout
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in
\textheight 8.75in
% Bold the 'Figure #' in the caption and separate it from the title/caption with a period
% Captions will be left justified
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,justification=raggedright,singlelinecheck=off]{caption}
\renewcommand{\figurename}{Fig}
% Use the PLoS provided BiBTeX style
\bibliographystyle{plos2015}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Leave date blank
\date{}
% Header and Footer with logo
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\setlength{\headheight}{27.023pt}
\lhead{\includegraphics[width=2.0in]{PLOS-submission.eps}}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\lfoot{\sf PLOS}
%% Include all macros below
\newcommand{\withurl}[2]{{#1}}
\newcommand{\practicesection}[2]{\section{#1}\label{#2}}
\newcommand{\practice}[2]{\textbf{\emph{{#2}~({#1})}}}
\begin{document}
\vspace*{0.2in}
\begin{flushleft}
{\Large
\textbf\newline{Good Enough Practices in Scientific Computing}
}
\newline
\\
{Greg~Wilson}\textsuperscript{1,\ddag *},
{Jennifer~Bryan}\textsuperscript{2,\ddag},
{Karen~Cranston}\textsuperscript{3,\ddag},
{Justin~Kitzes}\textsuperscript{4,\ddag},
{Lex~Nederbragt}\textsuperscript{5,\ddag},
{Tracy~K.~Teal}\textsuperscript{6,\ddag}
\\
\textbf{1} Software Carpentry Foundation / [email protected]
\\
\textbf{2} RStudio, University of British Columbia / [email protected]
\\
\textbf{3} Duke University / [email protected]
\\
\textbf{4} University of California, Berkeley / [email protected]
\\
\textbf{5} University of Oslo / [email protected]
\\
\textbf{6} Data Carpentry / [email protected]
\\
\bigskip
{\ddag} These authors contributed equally to this work.
\\
* E-mail: Corresponding [email protected]
\end{flushleft}
\title{Good Enough Practices for Scientific Computing}
\section*{Abstract}
We present a set of computing tools and techniques that every
researcher can and should consider adopting. These recommendations
synthesize inspiration from our own work, from the experiences of the
thousands of people who have taken part in Software Carpentry and Data
Carpentry workshops over the past six years, and from a variety of
other guides. Our recommendations are aimed specifically at people
who are new to research computing.
\section*{Author Summary}
Computers are now essential in all branches of science, but most
researchers are never taught the equivalent of basic lab skills for
research computing. As a result, data can get lost, analyses can take
much longer than necessary, and researchers are limited in how
effectively they can work with software and data. Computing
workflows need to follow the same practices as lab projects and
notebooks, with organized data, documented steps and the project
structured for reproducibility, but researchers new to computing often
don't know where to start.
This paper presents a set of good computing practices that every
researcher can adopt regardless of their current level of
computational skill. These practices, which encompass data
management, programming, collaborating with colleagues, organizing
projects, tracking work, and writing manuscripts, are drawn from a
wide variety of published sources, from our daily lives, and from our
work with volunteer organizations that have delivered workshops to
over 11,000 people since 2010.
\linenumbers
\section*{Introduction}\label{sec:introduction}
Three years ago a group of researchers involved in \withurl{Software
Carpentry}{http://software-carpentry.org/} and \withurl{Data
Carpentry}{http://datacarpentry.org/} wrote a paper called ``Best
Practices for Scientific Computing'' \cite{wilson2014}. That paper
provided recommendations for people who were already doing significant
amounts of computation in their research. However, as computing has
become an essential part of science for all researchers, there is a
larger group of people newer to scientific computing, and the question
then becomes ``where to start?''.
This paper focuses on these first, accessible skills and perspectives
- the ``good enough'' practices - for scientific computing, a minimum
set of tools and techniques that we believe every researcher can and
should consider adopting. It draws inspiration from many sources
\cite{gentzkow2014,noble2009,brown2015,wickham2014,kitzes2016,sandve2013,hart2016},
from our personal experience, and from the experiences of the
thousands of people who have taken part in Software Carpentry and Data
Carpentry workshops over the past six years.
Our intended audience is researchers who are working alone or with a
handful of collaborators on projects lasting a few days to several
months. A practice is included in our list if large numbers of
researchers use it, and large numbers of people are \emph{still} using
it months after first trying it out. We include the second criterion
because there is no point recommending something that people won't
actually adopt.
Many of our recommendations are for the benefit of the collaborator
every researcher cares about most: their future self (as the
joke goes, yourself from three months ago doesn't answer
email{\ldots}). Change is hard and if researchers don't see those
benefits quickly enough to justify the pain, they will almost
certainly switch back to their old way of doing things. This rules
out many practices, such as code review, that we feel are essential
for larger-scale development (Section~\ref{sec:omitted}).
We organize our recommendations into the following topics (Box~1):
\begin{itemize}
\item Data Management:
saving both raw and intermediate forms; documenting all steps;
creating tidy data amenable to analysis.
\item Software:
writing, organizing, and sharing scripts and programs used in an
analysis.
\item Collaboration:
making it easy for existing and new collaborators to understand and
contribute to a project.
\item Project Organization:
organizing the digital artifacts of a project to ease discovery and
understanding.
\item Tracking Changes:
recording how various components of your project change over time.
\item Manuscripts:
writing manuscripts in a way that leaves an audit trail and
minimizes manual merging of conflicts.
\end{itemize}
\subsection*{Acknowledgments}
We are grateful to Arjun Raj (University of Pennsylvania), Steven
Haddock (Monterey Bay Aquarium Research Institute), Stephen Turner
(University of Virginia), Elizabeth Wickes (University of Illinois),
and Garrett Grolemund (RStudio) for their feedback on early versions
of this paper, to those who contributed during the outlining of the
manuscript, and to everyone involved in Data Carpentry and Software
Carpentry for everything they have taught us.
\practicesection{Data Management}{sec:data}
Data within a project may need to exist in various forms, ranging from
what first arrives to what is actually used for the primary analyses.
Our recommendations have two main themes. One is to work towards
ready-to-analyze data incrementally, documenting both the intermediate
data and the process. We also describe the key features of ``tidy
data'', which can be a powerful accelerator for analysis
\cite{wickham2014,hart2016}.
\begin{enumerate}
\item
\practice{1a}{Save the raw data}. Where possible, save data as
originally generated (i.e. by an instrument or from a survey). It
is tempting to overwrite raw data files with cleaned-up versions,
but faithful retention is essential for re-running analyses from
start to finish; for recovery from analytical mishaps; and for
experimenting without fear. Consider changing file permissions to
read-only or using spreadsheet protection features, so it is harder
to damage raw data by accident or to hand edit it in a moment of
weakness.
Some data will be impractical to manage in this way. For example,
you should avoid making local copies of large, stable databases. In
that case, record the exact procedure used to obtain the raw data,
as well as any other pertinent information, such as an official
version number or the date of download.
\practice{1b}{Ensure that raw data is backed up in more than one
location}. If external hard drives are used, store them off-site
of the original location. Universities often have their own data
storage solutions, so it is worthwhile to consult with your local
Information Technology (IT) group or library. Alternatively cloud
computing resources, like \withurl{Amazon Simple Storage Service
(Amazon S3)}{https://aws.amazon.com/s3/},
\withurl{Google Cloud Storage}{https://cloud.google.com/storage/}
or \withurl{Microsoft Azure}
{https://azure.microsoft.com/en-us/services/storage/}
are reasonably priced and reliable.
For large data sets, where storage and transfer can be
expensive and time-consuming, you may need to use incremental backup
or specialized storage systems, and people in your local IT group or
library can often provide advice and assistance on options at your
university or organization as well.
\item
\practice{1b}{Create the data you wish to see in the world}. Create
the dataset you \emph{wish} you had received. The goal here is to
improve machine and human readability, but \emph{not} to do vigorous
data filtering or add external information. Machine readability
allows automatic processing using computer programs, which is
important when others want to re-use your data. Specific examples of
non-destructive transformations that we recommend at the beginning
of analysis:
\emph{File formats}: Convert data from closed, proprietary formats
to open, non-proprietary formats that ensure machine readability
across time and computing setups \cite{ffIllinois}. Good options
include CSV for tabular data, JSON, YAML, or XML for non-tabular
data such as graphs (the node-and-arc kind), and HDF5 for
certain kinds of structured data.
\emph{Variable names}: Replace inscrutable variable names and
artificial data codes with self-explaining alternatives, e.g.,
rename variables called \texttt{name1} and \texttt{name2} to
\texttt{personal\_name} and \texttt{family\_name}, recode the
treatment variable from \texttt{1} vs. \texttt{2} to
\texttt{untreated} vs. \texttt{treated}, and replace artificial
codes for missing data, such as ``-99'', with \texttt{NA}, a code
used in most programming languages to indicate that data is ``Not
Available'' \cite{white2013}.
\emph{Filenames}: Store especially useful metadata as part of the
filename itself, while keeping the filename regular enough for easy
pattern matching. For example, a filename like
\texttt{2016-05-alaska-b.csv} makes it easy for both people and
programs to select by year or by location.
\item
\practice{1c}{Create analysis-friendly data}. Analysis can be much
easier if you are working with so-called ``tidy'' data
\cite{wickham2014}. Two key principles are:
\emph{Make each column a variable}: Don't cram two variables into
one, e.g., ``male\_treated'' should be split into separate variables
for sex and treatment status. Store units in their own variable or
in metadata, e.g., ``3.4'' instead of ``3.4kg''.
\emph{Make each row an observation}: Data often comes in a wide
format, because that facilitated data entry or human
inspection. Imagine one row per field site and then columns for
measurements made at each of several time points. Be prepared to
gather such columns into a variable of measurements, plus a new
variable for time point. Fig~\ref{fig:tidy} presents an example
of such a transformation.
\begin{figure}
\includegraphics[width = 5in]{tidy-data.png}
\caption{\textbf{Example of gathering columns to create tidy data.}}
\label{fig:tidy}
\end{figure}
\item
\practice{1d}{Record all the steps used to process data}. Data
manipulation is as integral to your analysis as statistical modeling
and inference. If you do not document this step thoroughly, it is
impossible for you, or anyone else, to repeat the analysis.
The best way to do this is to write scripts for \emph{every} stage
of data processing. This might feel frustratingly slow, but you
will get faster with practice. The immediate payoff will be the ease
with which you can re-do data preparation when new data arrives. You
can also re-use data preparation steps in the future for related
projects. For very large data sets, data preparation may also
include writing and saving scripts to obtain the data or subsets of
the data from remote storage.
Some data cleaning tools, such as
\withurl{OpenRefine}{http://www.openrefine.org}, provide a graphical
user interface, but also automatically keep track of each step in
the process. When tools like these or scripting is not feasible,
it's important to clearly document every manual action (what menu
was used, what column was copied and pasted, what link was clicked,
etc.). Often you can at least capture \emph{what} action was taken,
if not the complete \emph{why}. For example, choosing a region of
interest in an image is inherently interactive, but you can save the
region chosen as a set of boundary coordinates.
\item
\practice{1e}{Anticipate the need to use multiple tables, and use a
unique identifier for every record}. Raw data, even if tidy, is
not necessarily complete. For example, the primary data table might
hold the heart rate for individual subjects at rest and after a
physical challenge, identified via a subject ID. Demographic
variables, such as subject age and sex, are stored in a second table
and will need to be brought in via merging or lookup. This will go
more smoothly if subject ID is represented in a common format in
both tables, e.g., always as ``14025'' versus ``14,025'' in one
table and ``014025'' in another. It is generally wise to give each
record or unit a unique, persistent key and to use the same names
and codes when variables in two datasets refer to the same thing.
\item
\practice{1f}{Submit data to a reputable DOI-issuing repository so
that others can access and cite it}. Your data is as much a
product of your research as the papers you write, and just as likely
to be useful to others (if not more so). Sites such as
\withurl{Figshare}{https://figshare.com/},
\withurl{Dryad}{http://datadryad.org/}, and
\withurl{Zenodo}{https://zenodo.org/} allow others to find your
work, use it, and cite it; we discuss licensing in
Section~\ref{sec:collaboration} below. Follow your research
community's standards for how to provide metadata. Note that there
are two types of metadata: metadata about the dataset as a whole and
metadata about the content within the dataset. If the audience is
humans, write the metadata (the README file) for humans. If the
audience includes automatic metadata harvesters, fill out the formal
metadata and write a good README file for the humans
\cite{wickes2015}.
\end{enumerate}
Taken in order, the recommendations above will produce intermediate
data files with increasing levels of cleanliness and
task-specificity. An alternative approach to data management would be
to fold all data management tasks into a monolithic procedure for data
analysis, so that intermediate data products are created ``on the
fly'' and stored only in memory, not saved as distinct files.
While the latter approach may be appropriate for projects in which
very little data cleaning or processing is needed, we recommend the
explicit creation and retention of intermediate products. Saving
intermediate files makes it easy to re-run \emph{parts} of a data
analysis pipeline, which in turn makes it less onerous to revisit and
improve specific data processing tasks. Breaking a lengthy workflow
into pieces makes it easier to understand, share, describe, and
modify. This is particularly true when working with large data sets,
where storage and transfer of the entire data set is not trivial or
inexpensive.
\practicesection{Software}{sec:software}
If you or your group are creating tens of thousands of lines of
software for use by hundreds of people you have never met, you are
doing software engineering. If you're writing a few dozen lines now
and again, and are probably going to be its only user, you may not be
doing engineering, but you can still make things easier on yourself by
adopting a few key engineering practices. What's more, adopting these
practices will make it easier for people to understand and (re)use
your code.
The core realization in these practices is that \emph{readable},
\emph{reusable}, and \emph{testable} are all side effects of writing
\emph{modular} code, i.e., of building programs out of short,
single-purpose functions with clearly-defined inputs and
outputs~\cite{hunt1999}. Much has been written on this topic
~\cite{hunt1999,mcconnell2004,martin2008},
and this section focuses on practices that best balance ease of use
with benefit for you and collaborators.
\begin{enumerate}
\item
\practice{2a}{Place a brief explanatory comment at the start of
every program}, no matter how short it is. That comment should
include at least one example of how the program is used: remember, a
good example is worth a thousand words. Where possible, the comment
should also indicate reasonable values for parameters. An example
of such a comment is show below.
{\small
\begin{verbatim}
Synthesize image files for testing circularity estimation algorithm.
Usage: make_images.py -f fuzzing -n flaws -o output -s seed -v -w size
where:
-f fuzzing = fuzzing range of blobs (typically 0.0-0.2)
-n flaws = p(success) for geometric distribution of # flaws/sample (e.g. 0.5-0.8)
-o output = name of output file
-s seed = random number generator seed (large integer)
-v = verbose
-w size = image width/height in pixels (typically 480-800)
-h = show help message
\end{verbatim}
}
\item
\practice{2b}{Decompose programs into functions} that are no more
than one page (about 60 lines) long. A function is a reusable
section of software that can be treated as a black box by the rest
of the program. The syntax for creating functions depends on
programming language, but generally you name the function, list its
input parameters, and describe what information it produces.
Functions should take no more than five or six input parameters and
should not reference outside information.
The key motivation here is to fit the program into the most limited
memory of all: ours. Human short-term memory is famously incapable
of holding more than about seven items at once \cite{miller1956}. If
we are to understand what our software is doing, we must break it
into chunks that obey this limit, then create programs by combining
these chunks. Putting code into functions also makes it easier to
test and troubleshoot when things go wrong.
\item
\practice{2c}{Be ruthless about eliminating duplication}. Write and
re-use functions instead of copying and pasting code, and use data
structures like lists instead of creating many closely-related
variables, e.g. create \texttt{score = (1, 2, 3)} rather than
\texttt{score1}, \texttt{score2}, and \texttt{score3}.
Also look for well-maintained libraries that already do what you're
trying to do. All programming languages have libraries that you can
import and use in your code. This is code that people have already
written and made available for distribution that have a particular
function. For instances there are libraries for statistics,
modeling, mapping and many more. Many languages catalog the
libraries in a centralized source, for instance R has
\withurl{CRAN}{https://cran.r-project.org/}, Python has
\withurl{PyPI}{https://pypi.python.org/}, and so on. So
\practice{2d}{always search for well-maintained software libraries
that do what you need} before writing new code yourself, but
\practice{2e}{test libraries before relying on them}.
\item
\practice{2f}{Give functions and variables meaningful names}, both
to document their purpose and to make the program easier to read. As
a rule of thumb, the greater the scope of a variable, the more
informative its name should be: while it's acceptable to call the
counter variable in a loop \texttt{i} or \texttt{j}, things that are
re-used often, such as the major data structures in a program should
\emph{not} have one-letter names. Remember to follow each
language's conventions for names, such as \texttt{net\_charge} for
Python and \texttt{NetCharge} for Java.
\begin{quote}
\noindent \textbf{Tab Completion}
\\
Almost all modern text editors provide \emph{tab completion}, so
that typing the first part of a variable name and then pressing
the tab key inserts the completed name of the variable. Employing
this means that meaningful longer variable names are no harder to
type than terse abbreviations.
\end{quote}
\item
\practice{2g}{Make dependencies and requirements explicit}. This is
usually done on a per-project rather than per-program basis, i.e.,
by adding a file called something like \texttt{requirements.txt} to
the root directory of the project, or by adding a ``Getting
Started'' section to the \texttt{README} file.
\item
\practice{2h}{Do not comment and uncomment sections of code to
control a program's behavior}, since this is error prone and makes
it difficult or impossible to automate analyses. Instead, put
if/else statements in the program to control what it does.
\item
\practice{2i}{Provide a simple example or test data set} that users
(including yourself) can run to determine whether the program is
working and whether it gives a known correct output for a simple
known input. Such a ``build and smoke test'' is particularly helpful
when supposedly-innocent changes are being made to the program, or
when it has to run on several different machines, e.g., the
developer's laptop and the department's cluster.
\item
\practice{2j}{Submit code to a reputable DOI-issuing repository}
upon submission of paper, just as you do with data. Your software is
as much a product of your research as your papers, and should be as
easy for people to credit. DOIs for software are provided by
\withurl{Figshare}{https://figshare.com/} and
\withurl{Zenodo}{https://zenodo.org/}. Zenodo integrates directly
with GitHub.
\end{enumerate}
\practicesection{Collaboration}{sec:collaboration}
You may start working on projects by yourself or with a small group of
collaborators you already know, but you should design it to make it
easy for new collaborators to join. These collaborators might be new
grad students or postdocs in the lab, or they might be \emph{you}
returning to a project that has been idle for some time. As summarized
in \cite{steinmacher2015}, you want to make it easy for people to set
up a local workspace so that they \emph{can} contribute, help them
find tasks so that they know \emph{what} to contribute, and make the
contribution process clear so that they know \emph{how} to contribute.
You also want to make it easy for people to give you credit for your
work.
\begin{enumerate}
\item
\practice{3a}{Create an overview of your project}. Have a short
file in the project's home directory that explains the purpose of
the project. This file (generally called \texttt{README},
\texttt{README.txt}, or something similar) should contain the
project's title, a brief description, up-to-date contact
information, and an example or two of how to run various cleaning or
analysis tasks. It is often the first thing users and collaborators
on your project will look at, so make it explicit how you want
people to engage with the project. If you are looking for more
contributors, make it clear that you welcome contributors and
point them to the license (more below) and ways they can help.
You should also create a \texttt{CONTRIBUTING} file that describes
what people need to do in order to get the project going and use or
contribute to it, i.e., dependencies that need to be installed,
tests that can be run to ensure that software has been installed
correctly, and guidelines or checklists that your project adheres
to.
\item
\practice{3b}{Create a shared ``to-do'' list}. This can be a plain
text file called something like \texttt{notes.txt} or
\texttt{todo.txt}, or you can use sites such as GitHub or Bitbucket
to create a new \emph{issue} for each to-do item. (You can even add
labels such as ``low hanging fruit'' to point newcomers at issues
that are good starting points.) Whatever you choose, describe the
items clearly so that they make sense to newcomers.
\item
\practice{3c}{Decide on communication strategies}. Make explicit
decisions about (and publicize where appropriate) how members of the
project will communicate with each other and with externals users /
collaborators. This includes the location and technology for email
lists, chat channels, voice / video conferencing, documentation, and
meeting notes, as well as which of these channels will be public or
private.
\item
\practice{3d}{Make the license explicit}. Have a \texttt{LICENSE}
file in the project's home directory that clearly states what
license(s) apply to the project's software, data, and
manuscripts. Lack of an explicit license does not mean there isn't
one; rather, it implies the author is keeping all rights and others
are not allowed to or modify the material.
We recommend Creative Commons licenses for data and text, either
\withurl{CC-0}{https://creativecommons.org/about/cc0/} (the ``No
Rights Reserved'' license) or
\withurl{CC-BY}{https://creativecommons.org/licenses/by/4.0/} (the
``Attribution'' license, which permits sharing and re-use but
requires people to give appropriate credit to the creators). For
software, we recommend a permissive open source license such as the
MIT, BSD, or Apache license \cite{laurent2004}.
\begin{quote}
\noindent \textbf{What Not To Do}
\\
We recommend \emph{against} the ``no commercial use'' variations
of the Creative Commons licenses because they may impede some
forms of re-use. For example, if a researcher in a developing
country is being paid by her government to compile a public health
report, she will be unable to include your data if the license
says ``non-commercial''. We recommend permissive software licenses
rather than the GNU General Public License (GPL) because it is
easier to integrate permissively-licensed software into other
projects, see chapter three in \cite{laurent2004}.
\end{quote}
\item
\practice{3e}{Make the project citable} by including a
\texttt{CITATION} file in the project's home directory that
describes how to cite this project as a whole, and where to find
(and how to cite) any data sets, code, figures, and other artifacts
that have their own DOIs. The example below shows the
\texttt{CITATION} file for the Ecodata
Retriever (https://github.com/weecology/retriever); for an example
of a more detailed \texttt{CITATION} file, see the one for the khmer
project (https://github.com/dib-lab/khmer).
{\small
\begin{verbatim}
Please cite this work as:
Morris, B.D. and E.P. White. 2013. "The EcoData Retriever:
improving access to existing ecological data." PLOS ONE 8:e65848.
http://doi.org/doi:10.1371/journal.pone.0065848
\end{verbatim}
}
\end{enumerate}
\practicesection{Project Organization}{sec:project}
Organizing the files that make up a project in a logical and
consistent directory structure will help you and others keep track of
them. Our recommendations for doing this are drawn primarily from
\cite{noble2009,gentzkow2014}.
\begin{enumerate}
\item
\practice{4a}{Put each project in its own directory, which is named
after the project}. Like deciding when a chunk of code should be
made a function, the ultimate goal of dividing research into
distinct projects is to help you and others best understand your
work. Some researchers create a separate project for each manuscript
they are working on, while others group all research on a common
theme, data set, or algorithm into a single project.
As a rule of thumb, divide work into projects based on the overlap
in data and code files. If two research efforts share no data or
code, they will probably be easiest to manage independently. If they
share more than half of their data and code, they are probably best
managed together, while if you are building tools that are used in
several projects, the common code should probably be in a project of
its own. Projects do often require their own organizational model,
but below are general recommendations on how you can structure data,
code, analysis outputs and other files. The important concept is
that it is useful to organize the project by the types of files and
that consistency helps you effectively find and use things later.
\item
\practice{4b}{Put text documents associated with the project in the
\texttt{doc} directory}. This includes files for manuscripts,
documentation for source code, and/or an electronic lab notebook
recording your experiments. Subdirectories may be created for these
different classes of files in large projects.
\item
\practice{4c}{Put raw data and metadata in a \texttt{data}
directory, and files generated during cleanup and analysis in a
\texttt{results} directory}, where ``generated files'' includes
intermediate results, such as cleaned data sets or simulated data,
as well as final results such as figures and tables.
The \texttt{results} directory will \emph{usually} require
additional subdirectories for all but the simplest
projects. Intermediate files such as cleaned data, statistical
tables, and final publication-ready figures or tables should be
separated clearly by file naming conventions or placed into
different subdirectories; those belonging to different papers or
other publications should be grouped together. Similarly, the
\texttt{data} directory might require subdirectories to organize raw
data based on time, method of collection, or other metadata most
relevant to your analysis.
\item
\practice{4d}{Put project source code in the \texttt{src}
directory}. \texttt{src} contains all of the code written for the
project. This includes programs written in interpreted languages
such as R or Python; those written in compiled languages like Fortran,
C++, or Java; as well as shell scripts, snippets of SQL used to pull
information from databases; and other code needed to regenerate the
results.
This directory may contain two conceptually distinct types of files
that should be distinguished either by clear file names or by
additional subdirectories. The first type are files or groups of
files that perform the core analysis of the research, such as data
cleaning or statistical analyses. These files can be thought of as
the ``scientific guts'' of the project.
The second type of file in \texttt{src} is controller or driver
scripts that that contains all the analysis steps for the entire
project from start to finish, with particular parameters and data
input/output commands. A controller script for a simple project, for
example, may read a raw data table, import and apply several cleanup
and analysis functions from the other files in this directory, and
create and save a numeric result. For a small project with one main
output, a single controller script should be placed in the main
\texttt{src} directory and distinguished clearly by a name such as
``runall''. The short example in Fig~\ref{fig:script} is typical
of scripts of this kind; note how it uses one variable,
\texttt{TEMP\_DIR}, to avoid repeating the name of a particular
directory four times.
\begin{figure}
{\small
\begin{verbatim}
TEMP_DIR = ./temp_zip_files
echo "Packaging zip files required by analysis tool..."
mkdir $(TEMP_DIR)
./src/make-zip-files.py $(TEMP_DIR) *.dat
echo "Analyzing..."
./bin/sqr_mean_analyze -i $(TEMP_DIR) -b "temp"
echo "Cleaning up..."
rm -rf $(TEMP_DIR)
\end{verbatim}
}
\caption{\textbf{Example of a ``runall'' script.}}
\label{fig:script}
\end{figure}
\item
\pagebreak
\practice{4e}{Put compiled programs in the \texttt{bin} directory}.
\texttt{bin} contains executable programs compiled from code in the
\texttt{src} directory (the name \texttt{bin} is an old Unix
convention, and comes from the term ``binary''). Projects that
do not have any will not require \texttt{bin}.
\begin{quote}
\noindent \textbf{Scripts vs.\ Programs}
\\
We use the term ``script'' to mean ``something that is executed
directly as-is'', and ``program'' to mean ``something that is
explicitly compiled before being used''. The distinction is more
one of degree than kind---libraries written in Python are actually
compiled to bytecode as they are loaded, for example---so one
other way to think of it is ``things that are edited directly''
and ``things that are not''.
\end{quote}
\begin{quote}
\noindent \textbf{External Scripts}
\\
If \texttt{src} is for human-readable source code, and
\texttt{bin} is for compiled binaries, where should projects put
scripts that are executed directly---particularly ones that are
brought in from outside the project? On the one hand, these are
written in the same languages as the project-specific scripts in
\texttt{src}; on the other, they are executable, like the programs
in \texttt{bin}. The answer is that it doesn't matter, as long as
each team's projects follow the same rule. As with many of our
other recommendations, consistency and predictability are more
important than hair-splitting.
\end{quote}
\item
\practice{4f}{Name all files to reflect their content or function}.
For example, use names such as \texttt{bird\_count\_table.csv},
\texttt{manuscript.md}, or \texttt{sightings\_analysis.py}. Do
\emph{not} using sequential numbers (e.g., \texttt{result1.csv},
\texttt{result2.csv}) or a location in a final manuscript (e.g.,
\texttt{fig\_3\_a.png}), since those numbers will almost certainly
change as the project evolves.
\end{enumerate}
The diagram in Fig~\ref{fig:project} provides a concrete example of
how a simple project might be organized following these
recommendations:
\begin{figure}
{\small
\begin{verbatim}
.
|-- CITATION
|-- README
|-- LICENSE
|-- requirements.txt
|-- data
| -- birds_count_table.csv
|-- doc
| -- notebook.md
| -- manuscript.md
| -- changelog.txt
|-- results
| -- summarized_results.csv
|-- src
| -- sightings_analysis.py
| -- runall.py
\end{verbatim}
}
\caption{\textbf{Project layout.}}
\label{fig:project}
\end{figure}
The root directory contains a \texttt{README} file that provides an
overview of the project as a whole, a \texttt{CITATION} file that
explains how to reference it, and a \texttt{LICENSE} file that states the
licensing. The \texttt{data} directory contains a
single CSV file with tabular data on bird counts (machine-readable
metadata could also be included here). The \texttt{src} directory
contains \texttt{sightings\_analysis.py}, a Python file containing
functions to summarize the tabular data, and a controller script
\texttt{runall.py} that loads the data table, applies functions
imported from \texttt{sightings\_analysis.py}, and saves a table of
summarized results in the \texttt{results} directory.
This project doesn't have a \texttt{bin} directory, since it does not
rely on any compiled software. The \texttt{doc} directory contains two
text files written in Markdown, one containing a running lab notebook
describing various ideas for the project and how these were
implemented and the other containing a running draft of a manuscript
describing the project findings.
\practicesection{Keeping Track of Changes}{sec:versioning}
Keeping track of changes that you or your collaborators make to data
and software is a critical part of research. Being able to reference or
retrieve a specific version of the entire project aids in reproducibility
for you leading up to publication, when responding to reviewer comments,
and when providing supporting information for reviewers, editors,
and readers.
We believe that the best tools for tracking changes are the version
control systems that are used in software development, such as Git,
Mercurial, and Subversion. They keep track of what was changed in a
file when and by whom, and synchronize changes to a central server so
that multiple contributors can manage changes to the same set of files.
While these version control tools make tracking changes easier, they
can have a steep learning curve. So, we provide two sets of
recommendations, \emph{1} a systematic manual approach for managing
changes and \emph{2} version control in its full glory, and you can
use the first while working towards the second, or just jump in to
version control.
Whatever system you chose, we recommend that you:
\begin{enumerate}
\item
\practice{5a}{Back up (almost) everything created by a human being
as soon as it is created}. This includes scripts and programs of
all kinds, software packages that your project depends on, and
documentation. A few exceptions to this rule are discussed below.
\item
\practice{5b}{Keep changes small}. Each change should not be so
large as to make the change tracking irrelevant. For example, a
single change such as ``Revise script file'' that adds or changes
several hundred lines is likely too large, as it will not allow
changes to different components of an analysis to be investigated
separately. Similarly, changes should not be broken up into pieces
that are too small. As a rule of thumb, a good size for a single
change is a group of edits that you could imagine wanting to undo in
one step at some point in the future.
\item
\practice{5c}{Share changes frequently}. Everyone working on the
project should share and incorporate changes from others on a
regular basis. Do not allow individual investigator's versions of
the project repository to drift apart, as the effort required to
merge differences goes up faster than the size of the
difference. This is particularly important for the manual versioning
procedure describe below, which does not provide any assistance for
merging simultaneous, possibly conflicting, changes.
\item
\practice{5d}{Create, maintain, and use a checklist for saving and
sharing changes to the project}. The list should include writing
log messages that clearly explain any changes, the size and content
of individual changes, style guidelines for code, updating to-do
lists, and bans on committing half-done work or broken code. See
\cite{gawande2011} for more on the proven value of checklists.
\item
\practice{5e}{Store each project in a folder that is mirrored off
the researcher's working machine} using a system such as
\withurl{Dropbox}{http://dropbox.com} or a remote version control
repository such as \withurl{GitHub}{http://github.com}. Synchronize
that folder at least daily. It may take a few minutes, but that time
is repaid the moment a laptop is stolen or its hard drive fails.
\end{enumerate}
\subsection*{Manual Versioning}
Our first suggested approach, in which everything is done by hand, has
two additional parts:
\begin{enumerate}
\item
\practice{5f}{Add a file called \texttt{CHANGELOG.txt} to the
project's \texttt{docs} subfolder}, and make dated notes about
changes to the project in this file in reverse chronological order
(i.e., most recent first). This file is the equivalent of a lab
notebook, and should contain entries like those shown below.
{\small
\begin{verbatim}
## 2016-04-08
* Switched to cubic interpolation as default.
* Moved question about family's TB history to end of questionnaire.
## 2016-04-06