forked from swcarpentry/r-novice-inflammation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-supp-loops-in-depth.Rmd
175 lines (135 loc) · 6.56 KB
/
03-supp-loops-in-depth.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
layout: page
title: Programming with R
subtitle: Loops in R
minutes: 30
---
```{r, include = FALSE}
source("tools/chunk-options.R")
```
> ## Learning Objectives {.objectives}
>
> * Compare loops and vectorized operations
> * Use the apply family of functions
In R you have multiple options when repeating calculations: vectorized operations, `for` loops, and `apply` functions.
This lesson is an extension of [Analyzing Multiple Data Sets](03-loops-R.html).
In that lesson, we introduced how to run a custom function, `analyze`, over multiple data files:
```{r analyze-function}
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
```
```{r files}
filenames <- list.files(path = "data", pattern = "inflammation", full.names = TRUE)
```
### Vectorized Operations
A key difference between R and many other languages is a topic known as vectorization.
When you wrote the `total` function, we mentioned that R already has `sum` to do this; `sum` is *much* faster than the interpreted `for` loop because `sum` is coded in C to work with a vector of numbers.
Many of R's functions work this way; the loop is hidden from you in C.
Learning to use vectorized operations is a key skill in R.
For example, to add pairs of numbers contained in two vectors
```{r}
a <- 1:10
b <- 1:10
```
you could loop over the pairs adding each in turn, but that would be very inefficient in R.
```{r}
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
```
Instead, `+` is a *vectorized* function which can operate on entire vectors at once
```{r}
res2 <- a + b
all.equal(res, res2)
```
### Vector Recycling
When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:
```{r}
a <- 1:10
b <- 1:5
a + b
```
The elements of `a` and `b` are added together starting from the first element of both vectors. When R reaches the end of the shorter vector `b`, it starts again at the first element of `b` and contines until it reaches the last element of the longest vector `a`. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector `a` by 5:
```{r}
a <- 1:10
b <- 5
a * b
```
Remember there are no scalars in R, so `b` is actually a vector of length 1; in order to add its value to every element of `a`, it is *recycled* to match the length of `a`.
When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:
```{r}
a <- 1:10
b <- 1:7
a + b
```
### `for` or `apply`?
A `for` loop is used to apply the same function calls to a collection of objects.
R has a family of functions, the `apply` family, which can be used in much the same way.
You've already used one of the family, `apply` in the first [lesson](01-starting-with-data.html).
The `apply` family members include
* `apply` - apply over the margins of an array (e.g. the rows or columns of a matrix)
* `lapply` - apply over an object and return list
* `sapply` - apply over an object and return a simplified object (an array) if possible
* `vapply` - similar to `sapply` but you specify the type of object returned by the iterations
Each of these has an argument `FUN` which takes a function to apply to each element of the object.
Instead of looping over `filenames` and calling `analyze`, as you did earlier, you could `sapply` over `filenames` with `FUN = analyze`:
```{r, eval=FALSE}
sapply(filenames, FUN = analyze)
```
Deciding whether to use `for` or one of the `apply` family is really personal preference.
Using an `apply` family function forces to you encapsulate your operations as a function rather than separate calls with `for`.
`for` loops are often more natural in some circumstances; for several related operations, a `for` loop will avoid you having to pass in a lot of extra arguments to your function.
### Loops in R Are Slow
No, they are not! *If* you follow some golden rules:
1. Don't use a loop when a vectorised alternative exists
2. Don't grow objects (via `c`, `cbind`, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/column
3. Allocate an object to hold the results and fill it in during the loop
As an example, we'll create a new version of `analyze` that will return the mean inflammation per day (column) of each file.
```{r}
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
```
Note how we add a new column to `out` at each iteration?
This is a cardinal sin of writing a `for` loop in R.
Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results.
Then we loop over the files but this time we fill in the `f`th column of our results matrix `out`.
This time there is no copying/growing for R to deal with.
```{r}
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
```
In this simple example there is little difference in the compute time of `analyze2` and `analyze3`.
This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations.
If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.
Note that `apply` handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to `apply`.
At its heart, `apply` is just a `for` loop with extra convenience.