-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.qmd
403 lines (275 loc) · 17.6 KB
/
intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
---
title-block-style: default
---
# Fundamentals {#sec-fundamentals}
## Project-oriented workflow
:::{.callout-note}
This subsection is largely based on Jenny Bryan's [What They Forgot (WTF) to Teach You About R](https://rstats.wtf/index.html).
:::
1 - Organize your work into a project.
This means within the file system store all your data, code, figures, notes, and related material within the same folder and subfolders.
2 - RStudio **P**rojects enforces this approach.
When you create a new project in RStudio, it creates a folder with some metadata and user options for that specific project (stored in the .Rproj file inside the folder it created for the project).
3 - RStudio Projects establish a working directory and use relative file paths by default. Usually this is what you want so when you share a project or move it from one computer to the next, it. just. works. This is also why it is critical to store your data and scripts within the project.
A typical project might have a file and folder structure like this:
<div class="cell">
<div class="sourceCode">
<pre class="sourceCode r cell-code">
<code class="sourceCode">
<i class="bi bi-folder-minus"></i> Project Folder
|
|--<i class="bi bi-folder-minus"></i> Data
| |
| |--<i class="bi bi-file-spreadsheet"></i> Raw Data
| |--<i class="bi bi-file-spreadsheet"></i> More Raw Data
|
|--<i class="bi bi-folder-minus"></i> Scripts
| |
| |--<i class="bi bi-file-text"></i> Analysis.R file
| |--<i class="bi bi-file-text"></i> Figures.R file
|
|--<i class="bi bi-folder-minus"></i> Figures
| |
| |--<i class="bi bi-file-earmark-image"></i> Plot
| |--<i class="bi bi-file-earmark-image"></i> Another figure
|
|--<i class="bi bi-folder-minus"></i> Exported Data
| |
| |--<i class="bi bi-file-spreadsheet"></i> Results
|
|---<i class="bi bi-folder-minus"></i> Reports
| |
| |--<i class="bi bi-file-text"></i> Report
|
|-<i class="bi bi-file-text"></i> Readme file (usually .Rmd, .md, or .txt file)
</code>
</pre>
</div>
</div>
### Your RStudio Project
Start a new project! Open RStudio, in the upper left click "File" -> "New Project." We generally want to start a project in a New Directory, so click that. One the next window click New Project. Now you can choose the subdirectory name of your project (folder name) followed by where you want that subdirectory to be stored. Click "Create Project" and RStudio create the subdirectory and puts a .Rproj file with specific project info in there for you.
![Example RStudio workspace](images/example-1.png){fig-alt="A screenshot of the RStudio program with different areas of the workspace noted."}
The RStudio workspace includes 3 major components. In the upper left, the script area shows the content of open R scripts (or any text based file that you open will show up here). You can edit, save, and run lines of code from this window.
At the bottom left, is the R console. This is where R operates. The code you wrote in the script gets loaded into the console and R does whatever is in the script. Output, messages, and warnings from your R code will probably show up here.
At the bottom right, are a couple of tabs. This is where graphical outputs are displayed. There are also tabs for files, packages, and help. The file tab lets you navigate, create, delete, and open files and folders. It defaults to your projects working directory. The packages tab is for exploring the packages you have installed, more on that below. Technically you can load and unload packages from here by clicking boxes next to each package. Don't do that. The help tab is just that, it lets you search functions in each package and displays the documentation for packages and functions. Learn to use this tab, it will help you just like it says!
### Running Code
You should generally write your code in the script window and execute it from there. This will save you from retyping code again and again.
If you have your cursor on an expression in your R script, use the keyboard shortcut: Ctrl+Enter to execute that expression. The cursor will automatically move to the next statement and the code will run in the console. If you want to execute the entire script at once, use the keyboard shortcut: Ctrl+Shift+S.
### Basic coding
Boxes with the grey background and blue vertical bar indicate chunks of R code. If there is an output when that code chunk is run by R, the output (text output, tables or figures) will follow directly below the chunk. For example, here is a code chunk:
```{r}
#| echo: true
#| eval: false
10*100
```
And this is the output:
```{r}
#| echo: false
#| eval: true
10*100
```
:::{.callout-note}
Much of this subsection is from [R for Data Science](https://r4ds.had.co.nz/workflow-basics.html) which you are encouraged to explore.
:::
This: `<-`, is called an assignment operator in R. We use it to create objects by assigning a value to a variable name. We name objects so we can easily refer to whatever you assigned later on in your script:
```{r}
x <- 10
y <- 100
x * y
```
You don't have to assign numbers:
```{r}
x <- "Hello"
print(x)
```
Assignment operators go either direction, you might find it useful to use the left to right assinment operator in some situations:
```{r}
"Hello" -> x
print(x)
```
However, for the most part, standard practice is to assign right to left so you can easily find the variable name receiving the value. Whatever you choose, use the same direction throughout your project.
As your scripts get more complicated, it is important to use descriptive object names. Object names can only contain letters, numbers, `_`, and `.`, so we recommend using "snake_case" to create object names:
```
streamflow
streamflow_site_a
```
Object names are case sensitive, `streamflow_site_a` is not the same as `streamflow_site_A`.
The `#` symbol is ignored by R and used to include human readable comments in your script. Use comments liberally in your code.
```{r}
#| code-overflow: wrap
#| eval: false
## I can write what I want
## and R does not evaluate
## this
a <- 1
a
```
## More About Objects
Before diving to much into R programming it is worth becoming familiar with some of the basic types of object in R. Function arguments might only accept specific types of objects and return specific types of objects. Being able to identify object types is fundamental for troubleshooting errors. This is not an exhaustive list but commonly objects are one of:
- A single *variable* that can be character, double, integer, raw, logical, or complex eg: `x <- 1`.
- *Vectors* are a one-dimensional set of homogeneous data types. Typically constructed using the `c()` function: `x <- c(1,2,3)`.
- *Lists* are similar to a vector but group together R objects. Lists can include different types of objects and are constructed using the `list()` function: `list_1 <- list(c(1,2,3,4,5), c("A", "B", "C"), list(TRUE, FALSE))`.
- *Data Frames* are two-dimensional lists or more easily thought of as tabular data similar to a spreadsheet. The data frame represents a list of variables which have the same number of rows. Columns can be of differing types as long as they have the same number of rows. The data frame and its cousin the *tibble* are extensively used in data analysis workflows. Data frames are constructed with the `data.frame()` function: `df <- data.frame(x = c(1,2,3), y = c("a", "b", "c"))`.
There are a number of other object types that I won't go into detail on. Matrices and arrays are multi-dimensional objects similar to matrices from linear algebra.
Most R objects have something called a *class* attribute. You can use the `class()` function to inspect any object:
```{r}
class("Hello")
df <- data.frame(x = c(1,2,3), y = c("a", "b", "c"))
class(df)
```
### Factors
Factors are a way of storing categorical information in R. Factors often appear to be character objects, but R stores the information as levels and numbers which facilitates statistical modeling off categorical variables. Even if you won't be modeling categorical variables, factors are valuable for storing the order to display information if that variable has some inherent order to it (something like drought category for example). Factors are created with the `factor()` function. When you use factor, R will recode the data as integers and store an additional attribute with the data called *levels* that stores the labels associated with an integer:
```{r}
difficulty <- factor(c("Hard", "Easy", "Easy", "Medium", "Medium"))
class(difficulty)
```
R orders these alphabetically by default. These makes sense if there is no other inherent ordering, but in this case we would probably want to order it something like: Easy, Medium, Hard. The use of this makes more sense once you start plotting or summarizing data into tables:
```{r}
## shows the class and levels
attributes(difficulty)
## to reorder the levels:
levels(difficulty) <- c("Easy", "Medium", "Hard")
## or tell R the order when you make the factor
factor(c("Hard", "Easy", "Easy", "Medium", "Medium"),
levels = c("Easy", "Medium", "Hard"))
```
### Dates
There are specific classes for date and time objects in R. Use the *Date* class to represent dates in R using the `as.Date()` function.
The character representation of Dates always defaults to `"yyyy-mm-dd"` format:
```{r}
as.Date("2021-01-01")
```
If your date string is in a different format, you need to specify it in the format argument. The help documentation in the `strptime()` function provide details on how to specify the format, an example is shown for dates entered as `"mm/dd/yyyy"`:
```{r}
as.Date("01/01/2021",
format = "%m/%d/%Y")
```
We can subtract two dates and get a *difftime* object that tells us the difference in days:
```{r}
n <- Sys.Date() - as.Date("2021-01-01")
n
## or specify units
difftime(Sys.Date(), as.Date("2021-01-01"), units = "weeks")
```
If you have Date-Time strings, the *POSIXct* class is your friend. The default format is shown below and `strptime()` provides the details on how to specify different formats. The time recorded in your data probably depends on a timezone and might shift based on daylight savings time (or might not depending on the device collecting the data). Use the time zone argument to specify the time. If you are setting up a device to collect data that will be read into R, I highly recommend setting up the device to collect in UTC offset.^[See: https://en.wikipedia.org/wiki/List_of_UTC_offsets]
```{r}
as.POSIXct("2021-06-01 12:00:00", tz = "UTC")
## example of UTC offset format
as.POSIXct("2021-06-01 06:00:00 -0600",
format = "%Y-%m-%d %H:%M:%S %z",
tz = "UTC")
```
## Functions
Functions are essentially tools that take input *arguments* and output some kind of value. Functions are the basis for most everything you do in R. For example, `seq()` is a function to generate a regular sequence of numbers. You can get to the help documentation by entering `?seq()` in the console. It takes the arguments `from, to, by, length.out, along.with`. Use `=` for argument values:
```{r}
#| code-overflow: wrap
#| eval: true
seq(from = 0, to = 10, by = 2)
```
Writing your own functions is one of the reasons for using R. Here is a simplistic function that generates a message in the console screen depending on the condition of the first argument.
```{r}
print_hello <- function(x) {
if (x < 1) message("Hello!")
else message("Bye")
}
print_hello(x = -1)
print_hello(x = 1)
```
Why write a function in the first place? Sometimes you might need to repeatedly run the same set of functions on different data or subsets of data. You will find yourself copy and pasting code and changing some values within. If the output is dependent on some values you forgot to change when you cut and paste, instant problems! Functions let you skip that copy and paste action, and just update the arguments. Here is an example of some code to calculate the confidence interval around the mean for a vector of numbers:
```{r}
min <- 0
max <- 10
n <- 1000
ci <- 0.95
x <- runif(n = n, min = min, max = max)
se <- sd(x)/sqrt(length(x))
alpha <- 1 - ci
mean(x) + se * qnorm(c(alpha/2, 1-alpha/2))
```
If we need to recalculate the confidence interval for different values or combinations of values of `x`, `n`, and `ci` we would have to cut and paste the chunk each time with the potential for data entry errors if the wrong values are entered. Instead, create a function and change the arguments as needed.
```{r}
ci <- function(min, max, n, ci) {
x <- runif(n = n, min = min, max = max)
se <- sd(x)/sqrt(length(x))
alpha <- 1 - ci
mean(x) + se * qnorm(c(alpha/2, 1-alpha/2))
}
ci(min = 0, max = 10, n = 1000, ci = 0.95)
ci(min = 10, max = 100, n = 1000, ci = 0.90)
ci(min = 10, max = 1000, n = 1000, ci = 0.80)
```
## Packages
Packages might be considered the toolboxes of R. They are generally a collection of functions and classes the expand the capabilities of the base R functions. Many packages have *dependencies* from other packages. This mean when you install one package, you may end up installing multiple other packages automatically that are required for the package that you chose to work. Normally this works without hiccup. However, before installing packages, I suggest restarting your R session and make sure no packages are currently loaded to prevent issues.
Most packages can and should be installed from the CRAN repository. These are a network of repositories that host the official, up-to-date and approved packages for R. This packages are pre-built, meaning you are unlikely to run into issues on installation. To install packages from CRAN, you typically do something like the following:
```{r}
#| code-overflow: wrap
#| eval: false
## install one package
install.packages("ggplot2")
## install multiple packages
install.packages("dplyr", "tidyr")
```
Sometimes you need a package or package version that is not currently available on CRAN. There are various justifiable reasons the packages might not be available on CRAN; however, one of the benefits of using CRAN packages is that they are all reviewed by a person before acceptance. This provides a safety mechanism for not only the quality of the package but potential security issues.
:::{.callout-note}
If you are installing a package from GitHub or other source, please review it for safety and quality before installation.
:::
There are two primary way to install non-CRAN packages. The preferred method is to install pre-built packages from an alternative repository like r-universe. The readme file associated with the package will generally inform you if the package is available on a different repository and how to install it from that repository.
An example of this is shown below for the *adc* package:
```{r}
#| code-overflow: wrap
#| eval: false
install.packages('adc', repos = c(txwri = 'https://txwri.r-universe.dev'))
```
An alternative option is to download and build the packages from the source, such as GitHub. For those on Windows, you will need to install the [RTools toolchain](https://cran.r-project.org/bin/windows/Rtools/). Then, we can use the *remotes* package to download, build and install a package from GitHub:
```{r}
#| code-overflow: wrap
#| eval: false
install.packages("remotes")
remotes::install_github("mps9506/rATTAINS")
```
After you install a package, you need to load the package in order to use the functions. Confusingly, you use the `library()` function to accomplish this. Standard practice is to load libraries at the top of your script:
```{r}
#| code-overflow: wrap
#| eval: false
library(dplyr)
library(ggplot2)
library(readr)
```
## Other Coding Conventions
### Pipe Function
Many of the examples in this manual use something called the *pipe* function (either `%>%` or `|>` in newer versions of R). The pipe takes the output of a function and makes it the input of the next function without writing it to an object in the environment. The primary advantage of this is when you have many data processing steps in a row and you don't want to write or overwrite an object at every step. In this manual we will use the ` |> ` function.
Here is a short example where we take the mean of a sample of numbers
```{r}
## generate a numeric vector of randomly log normally distributed
## values whose mean log is 100
x <- rlnorm(100, meanlog = 100)
## calculate the mean of the log
mean(log(x))
```
This is a little messy. If we had multiple steps this could make the code difficult to interprt. The pipe function separates this into steps. The code below pipes `x` into the first argument of `log()`, the output of `log()` is piped into the first argument of `mean()`. This delineates each step in the data workflow.
```{r}
## this is the same as above
x |>
log() |>
mean()
```
If it isn't the first argument that needs the output from the previous function, we use the placeholder `_`. A good example is the function to fit a linear regresion, `lm()` which requires the regression formula as the first argument and data for the second argument.
We generate vector of sample random predictor data (x) from a uniform distribution using `runif()`. A vector `b` of coefficents normally distributed around mean 10 is generated using `rnorm()`. The response is generated using x * b. The linear regression should return a coefficient for x that is near 10.
```{r}
## generate a pretend response variable
x <- runif(100, min = 10, max = 100)
b <- rnorm(n = 100, mean = 10, sd = 2)
data.frame(x = x,
y = x * b) |>
## fit a regression model, notice the "_" placeholder
lm(y ~ x, data = _) |>
## return the regression summary
summary()
```
## Suggested RStudio Settings
:::{.callout-note}
## Incomplete
I still need to add discussion on
- setting global and project options in RStudio
- using ragg graphics device (this might be better in figure section)
:::