forked from UofTCoders/rcourse
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlec16-rmarkdown.Rmd
456 lines (349 loc) · 14.7 KB
/
lec16-rmarkdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
---
title: "Reproducible workflow, Metadata, and R Markdown"
author: "Lina Tran"
---
## Lesson preamble
> ### Lesson objectives:
>
> - Learn why reproducibility in science is important
> - Discuss how we can improve reproducibility of our research
> - RMarkdown formatting: including images, citations, footnotes, and output formats.
>
> ### Lesson outline:
>
> - Reproducible Science (50 min)
> - Introduction to the problem and discussion (20 min)
> - How we can improve reproducibility (15 min)
> - What are the barriers to reproducibility? (15 min)
> - Metadata (20 min)
> - What is metadata and why do we need it? (10 min)
> - What are best practices in generating metadata? (10 min)
> - Intermediate topics in R Markdown (40 min)
> - Online Tutorial (10 min)
> - Lists, Tables (5 min)
> - Images, Figures (5 min)
> - In-Line Citations & Bibliography (10 min)
> - Footnotes (5 min)
> - Output Formatting Options (5 min)
-----
## Reproducible Science
What does reproducibility mean to you? Let's discuss what each of these may entail:
- Computational reproducibility
- Scientific reproducibility
- Statistical reproducibility
### Why do we care about reproducibility?
Take 5-10 minutes to read over the following blog post section on [reproducibility
in science](https://bioconnector.org/r-rmarkdown.html#who_cares_about_reproducible_research)
and [what's in it for you](https://bioconnector.org/r-rmarkdown.html#what’s_in_it_for_you).
Now, let's discuss the following questions:
1. Why does reproducibility matter in science?
2. What do you think about when you hear the term "open science"?
3. How does open science translate back to the issue of reproducibility
- How does it affect collaboration, and the progress of science?
Events of interest supporting Open Science:
1. [Open Access Week](http://www.openaccessweek.org/)
2. [OpenCon 2017](https://www.opencon2017.org/) sponsored by eLife this year
- [Authorea Blog Competition for OpenCon 2017](https://www.authorea.com/users/111970/articles/206557-announcing-opencon2017-london-blog-competition)
### What are ways we can make research more reproducible?
- Learning from software engineers who have used continuous integration to keep
track of their projects in an automated fashion (helping to prevent from human error)
for years. This includes the following:
- analyses/models are run as scripts that can be replicated on other machines
and environments
- code is version-controlled (e.g. Git/GitHub)
- code is well tested
- for a take on "Continuous Analysis", read this [article from ELife](https://elifesciences.org/labs/e623676c/reproducibility-automated)
or [this article](https://www.biorxiv.org/content/early/2016/08/11/056473) in
Nature Biotechnology with everything also [on GitHub!](https://github.com/greenelab/continuous_analysis)
- For more recommendations on reproducible research, especially as it pertains
to coding practices, read more [here](https://bioconnector.org/r-rmarkdown.html#some_recommendations_for_reproducible_research)
### What are the barriers to reproducibility?
- Brainstorm reasons why scientists are not readily embracing reproducible science.
```{r, echo=FALSE}
#Some ideas:
# - workflows are hard to document at every step
# - we already have systems in place that work, and would be hard/time consuming to upend them
# - people and skills, currently low adoption and hard to force collaborators to change
# - PIs not willing adopt new practices and say they do not have time to relearn
# - Many students doing this might in a lab where the PI is not computational and does not fully
# appreciate the benefits. It is also hard to take the first steps as a student not knowing what
# the state of the art is.
# - Few, if any, courses in this at the university, especially outside of computer science.
```
### Further Reading
- This book called [Practice Reproducible Research](https://www.practicereproducibleresearch.org/)
has many high level examples of how reproducible research looks in practice and further reading on the topic
- Nature has a series of articles, editorials and research on the [challenges in reproducible research](https://www.nature.com/news/reproducibility-1.17552)
## Metadata
### Why is Metadata Important?
From the [Mozilla Science WOW Data Reuse Template](https://github.com/mozillascience/working-open-workshop/blob/gh-pages/handouts/data_reuse_plan_template.md):
> Standard Metadata: Increasingly, scientific fields are moving towards standard
> metadata formats (data.json, data.xml, etc) to pull all the information in the
> Data Reuse plan together in a machine readable format. Machine readable metadata
> enables cataloging of datasets on sites like Data.gov and allows others to ask
> questions and access your datasets using code. For example, open US government
> data online is required to expose a data.json in the landing page html to be
> listed on Data.gov, thereby facilitating data discovery. Because not all
> researchers are mandated to actually include data.json files, Data.gov is
> incomplete, and simple questions like "what is the total volume of data
> generated by US federally funded scientists?" are unanswerable.
Let's go over a few of these questions:
1. What do you think metadata is?
2. When you were choosing your dataset, did you encounter problems understanding the data?
3. If so, what would have helped you to understand?
### What Are Best Practices in Generating Metadata?
Take some time to read over the first page of the [Center for Government
Excellence's](https://govex.jhu.edu/) ["Open Data Metadata Guide"](https://www.gitbook.com/book/centerforgov/open-data-metadata-guide/details)
to get a better idea.
Now let's browse through the sections for more specific best practices.
To review, metadata includes information about how the data was collected,
succinct descriptions of the data as well as information about when it was last
updated are very important to let people know if this is the right dataset for them.
Once people decide to use your dataset, things like licensing and column
metadata become very important. Column metadata are sometimes called data dictionaries.
Here is an example: <https://liberalarts.utexas.edu/redcap/_files/data_dictionary_example.jpg>
## R Markdown & Knitr
R Markdown makes use of Pandoc's markdown formatting. We've seen a lot of the
basic components to format our text so far, but to see the complete list,
please visit the [official documentation](https://pandoc.org/MANUAL.html#pandocs-markdown).
Before we start, everybody can do this 10 minute tutorial on markdown:
https://commonmark.org/help/tutorial/
### Lists, Tables, In-Line Code
#### Lists
**Unordered Lists (i.e. bullets)**
```
- Unordered list item
- Unordered list sub-item
- Unordered list item
```
- Unordered list item
- Unordered list sub-item
- Unordered list item
**Ordered Lists (i.e. numbers)**
```
1. Ordered list item
1. Ordered list sub-item
2. Ordered list sub-item
2. Ordered list item
```
1. Ordered list item
1. Ordered list sub-item
2. Ordered list sub-item
2. Ordered list item
#### Tables
The `knitr` package has a function called `kable` that helps to display tables
from an `r` code chunk nicely. It is best to use `echo=FALSE` and `results='asis'`.
<pre class="markdown"><code>```{r, echo=FALSE, results='asis'}
library(knitr)
kable(head(mtcars))
```
</code></pre>
```{r, echo=FALSE, results='asis'}
library(knitr)
kable(head(mtcars))
```
You can also set the default data frame printing via the `df_print` option in
your YAML metadata under `output` to do this automatically.
```
---
title: Document
output:
html_document:
df_print:kable
---
```
#### In-line Code
If you want to state a value in your data in your text, it is best to reference
to the actual variable or code containing that value rather than manually writing
it out. Here is an example:
<code> There are `r nrow(df)` samples in this experiment. </code>
### Images and Figures
To include an image, use the following syntax which will store the caption of your
image and the image source to display:
```
[caption for my image](path/to/image.png)
```
If you would like the caption you wrote to be included underneath your image,
put the following in your YAML metadata:
```
---
title: Document
output:
html_document:
fig_caption: yes
---
```
As you've learned in doing your assignments, both the code and output of your
`r` code chunk such as figures will show up in your output document. However,
using code chunk options `echo` and `eval`, you can suppress output of the code
underlying a graph, and only show the resulting plot from your code or vice versa.
In this case, you probably want the former, and it would look like this:
<pre class="markdown"><code>```{r, eval=TRUE, echo=FALSE}
library(ggplot2)
qplot(mpg, wt, data=mtcars)
```
</code></pre>
```{r, eval=TRUE, echo=FALSE}
library(ggplot2)
qplot(mpg, wt, data = mtcars)
```
You can also set the figure height and width too using the `fig.height` and
`fig.width` options.
<pre class="markdown"><code>```{r, fig.width=7, fig.height=7, eval=TRUE, echo=FALSE}
library(ggplot2)
qplot(mpg, wt, data=mtcars)
```
</code></pre>
```{r, fig.width=7, fig.height=7, eval=TRUE, echo=FALSE}
library(ggplot2)
qplot(mpg, wt, data = mtcars)
```
### In-Line Citations & Bibliography
Pandoc can automatically generate citations and a bibliography in a number of
styles. In order to use this feature, you will need to specify a bibliography
file using the bibliography metadata field in a YAML metadata section.
For example:
```
---
title: "Sample Document"
output: html_document
bibliography: bibliography.bib
---
```
Many bibliography formats are accepted (see the [R Markdown guides](https://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html)),
such as a .bib file which many citation managers can generate for you and will
hold all the references you need for your document.
These are some great open source reference managers you can take advantage of
for managing your references, all of which make it pretty easy to export as BibTeX or .bib file:
- [Zotero](https://www.zotero.org/)
- [Jabref](https://www.jabref.org/)
- [Docear](https://www.docear.org/)
Every entry in your bibliography file should have a shorthand key id which
when preceded by '@' allows you to reference the citation in-line. They usually
go within square brackets and are separated by semicolons.
```
One Citation:
Some fact [@Smith2014]
Multiple Citations:
Statement [@Smith2014; @Logan1997].
```
To make a bibliography, you may also want to specify a citation style guide to
format your bibliography (in the form of a .csl file). The official repository
for recognized citation styles is available [here](https://github.com/citation-style-language/styles). Because of the permissive licensing, you can actually customize or
make your own styles too! [This visual editor](https://editor.citationstyles.org/visualEditor/) is a great tool to modify styles to your liking. Download the file and put it
somewhere you will remember and can access later.
I like to keep it in the same folder as my project.
Once you have the correct .csl file, you can specify it in your YAML metadata as follows:
```
---
title: "Sample Document"
output: html_document
bibliography: bibliography.bib
csl: nature.csl
---
```
### Footnotes
```
This is what a footnote looks like.[^1] Here is another.[^2]
[^1]: My first footnote.
[^2]: My second footnote.
This will produce the following:
```
This is what a footnote looks like.[^1] Here is another.[^2]
[^1]: My first footnote.
[^2]: My second footnote.
These are great because you can reference your footnotes by name and don't have to
re-number them if things get reordered. The numbers in the rendered document will
be reordered for you, by order of occurrence! From the [Pandoc documentation](https://pandoc.org/MANUAL.html#footnotes):
> The identifiers in footnote references may not contain spaces, tabs, or newlines.
> These identifiers are used only to correlate the footnote reference with the
> note itself; in the output, footnotes will be numbered sequentially.
### Output Formatting Options
#### Changing the output file format
To change the output format of your .Rmd file, try changing the `output`
metadata in the YAML header from "html_document" to "word_document".
In order to output to "pdf_document", you need to have a LaTeX engine
installed.
#### Table of Contents
To add a table of contents generated from the headers of your document, use
the `toc` option as `true` and specify the depth of headers to list via
`toc_depth` where the default is 3. These sections can also be numbered by
using the `number_sections` option.
```
---
title: "Making TOCs"
output:
pdf_document:
toc: true
toc_depth: 2
number_sections: true
---
```
#### Figure Options
Some figure options can be set in the YAML header.
- `fig_width` and `fig_height` can be used to control the default figure width
and height (7x5 is used by default)
- `fig_caption` controls whether figures are rendered with captions
### Exercise
1. Open up RStudio and make a new R Markdown file.
2. Set up the YAML metadata:
- title is "Lecture 17 Exercise"
- author (you)
- output to html
- make a table of contents
- figure height and width should be set to 10
3. Make a header called "Beavers Plot" and below create any simple plot using
the `beavers` dataset in an R chunk where the code is suppressed, but the plot
is shown.
4. Make a header called "R Markdown" and below it, recreate the following:
- Fruits
1. Apple
2. **Orange**
3. Banana
4. Tomato[^3]
- Vegetables
1. *Brussel Sprouts*
2. Carrots
[^3]: Often confused for a vegetable
- First 6 rows of Iris Dataset
```{r, asis=TRUE, echo=FALSE}
library(knitr)
kable(head(iris))
```
5. Now Knit the file to html and compare with your neighbour to see if you got
the same output and work together to fix any issues.
See the code chunk below for the **solution**.
<pre class="markdown"><code>
---
title: "Lecture 17 Exercise"
author: "Lina Tran"
output:
html_document:
fig_width: 10
fig_height: 10
toc: TRUE
---
# Beavers Plot
```{r eval=TRUE, echo=FALSE}
library(ggplot2)
qplot(time, temp, data=beaver1)
```
# R Markdown
- Fruits
1. Apple
2. **Orange**
3. Banana
4. Tomato[^3]
- Vegetables
1. *Brussel Sprouts*
2. Carrots
[^3]: Often confused for a vegetable
- First 6 rows of Iris Dataset
```{r, asis=TRUE, echo=FALSE}
library(knitr)
kable(head(iris))
```
</code></pre>
## Resources
- The [R Markdown documentation](https://rmarkdown.rstudio.com/lesson-1.html)
from RStudio can be a very helpful guide.