-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathrprog.Rmd
562 lines (424 loc) · 13.7 KB
/
rprog.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
# R programming
This chapter is about base R stuff that I find important and that is often overlooked or unknown to most R users.
Learn more with the [Advanced R book](https://adv-r.hadley.nz/).
```{r, include=FALSE}
source("knitr-options.R")
source("spelling-check.R")
```
## Common mistakes
> If you are using R and you think you're in hell, [this is a map](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) for you.
>
> -- Patrick Burns
### Equality
```{r}
(0.1 + 0.2) == 0.3
print(c(0.1, 0.2, 0.3), digits = 20)
all.equal(0.1 + 0.2, 0.3) ## equality with some tolerance
all.equal(0.1 + 0.2, 0.3, tolerance = 0)
all.equal(0.1 + 0.2, 0.4)
isTRUE(all.equal(0.1 + 0.2, 0.4)) ## if you want a boolean, use isTRUE()
dplyr::near(0.1 + 0.2, 0.3) ## similar, from the {dplyr} package
```
### Arguments
```{r}
min(-1, 5, 118)
max(-1, 5, 118)
mean(-1, 5, 118)
median(-1, 5, 118)
```
How to explain the issue with `mean` and `median`? Let us look at the parameters of these functions:
```{r}
args(max)
args(mean)
args(median)
```
One solution is to always use a vector:
```{r}
min(c(-1, 5, 118))
max(c(-1, 5, 118))
mean(c(-1, 5, 118))
median(c(-1, 5, 118))
```
### Others
```{r}
sample(1:10)
sample(10)
sample(10.1)
```
```{r}
n <- 10
1:n-1 ## is (1:n) - 1, so 0:(n - 1)
1:(n-1)
seq_len(n - 1)
1:0
seq_len(0) ## prefer using seq_len(n) rather than 1:n (e.g. in for-loops)
seq_along(5:7) ## a shortcut for seq_len(length(.))
```
## R base objects
### Types
There are several "atomic" types of data: `logical`, `integer`, `double` and `character` (in this order, see below). There are also `raw` and `complex`, but they are rarely used.
You cannot mix types in an atomic vector, but you can in a list. Coercion will automatically occur when you mix types in a vector:
```{r}
(a <- FALSE)
typeof(a)
(b <- 1:10)
typeof(b)
c(a, b) ## FALSE is coerced to an integer -> 0
(c <- 10.5)
typeof(c)
(d <- c(b, c)) ## coerced to numeric
c(d, "a") ## coerced to character
c(list(1), "a")
50 < "7" ## does "50" < "7"
```
### Exercise
Use the automatic type coercion to convert this boolean matrix to a numeric one (with 0s and 1s). [What do you need to change in your code to get an integer matrix instead of a numeric one?]
```{r}
(mat <- matrix(sample(c(TRUE, FALSE), 12, replace = TRUE), nrow = 3))
```
## Base objects and accessors
### Objects
- "atomic" vector: vector of one base type (see above).
- scalar: this doesn't exist, this is a vector of length 1.
- matrices / arrays: **a vector** with some dimensions (attribute).
```{r}
(vec <- 1:12)
dim(vec) <- c(3, 4)
vec
class(vec)
dim(vec) <- c(3, 2, 2)
vec
class(vec)
```
- list: vector of elements with possibly different types in it.
- data.frame: **a list** whose elements have the same lengths, and formatted somewhat as a matrix.
```{r}
head(iris)
dim(iris)
length(iris) ## a data.frame is also a list
```
### Accessors
1. The `[` accessor is used to access a subset of the data **with the same class**.
```{r}
(x <- 1:5)
x[2:3]
x[2:8] ## /!\ no warning
(y <- matrix(1:12, nrow = 3))
y[4:9] ## a matrix is also a vector
(l <- list(a = 1, b = "I love R", c = matrix(1:6, nrow = 2)))
l[2:3]
head(iris)
head(iris[3:4])
class(iris[5])
```
You can also use a logical and character vectors to index these objects.
```{r}
(x <- 1:4)
x[c(FALSE, TRUE, FALSE, TRUE)]
x[c(FALSE, TRUE)] ## logical vectors are recycled
head(iris[c("Petal.Length", "Species")])
```
2. The `[[` accessor is used to access **a single element**.
```{r}
(x <- 1:10)
x[[3]]
l[[2]]
iris[["Species"]]
```
```{r, echo=FALSE, fig.cap="Indexing lists in R. [Source: https://goo.gl/8UkcHq]"}
knitr::include_graphics("https://pbs.twimg.com/media/DQ5en8XWAAICIaJ.jpg")
```
3. Beware partial matching
```{r}
x <- list(aardvark = 1:5)
x$a
x[["a"]]
x[["a", exact = FALSE]]
```
4. Special use of the `[` accessor for array-like data.
```{r}
(mat <- matrix(1:12, 3))
mat[1, ]
mat[, 1:2]
mat[1, 1:2]
mat[1, 1:2, drop = FALSE]
(two_col_ind <- cbind(c(1, 3, 2), c(1, 4, 2)))
mat[two_col_ind]
mat[]
mat[] <- 2
mat
```
If you use arrays with more than two dimensions, simply add an additional comma for every new dimension.
### Exercises
1. Use the dimension attribute to make a function that computes the sums every n elements of a vector. In which order are matrix elements stored? [Which are the special cases that you should consider?]
```{r}
advr38pkg::sum_every(1:10, 2)
```
2. Compute the means of every numeric columns of the `iris` dataset. Expected result:
```{r, echo=FALSE}
colMeans(iris[sapply(iris, is.numeric)])
```
3. Convert the following matrix to a vector by replacing (0, 0) -> 0; (0, 1) -> 1; (1, 1) -> 2; (1, 0) -> NA.
```{r}
mat <- matrix(0, 10, 2); mat[c(5, 8, 9, 12, 15, 16, 17, 19)] <- 1; mat
```
by using this matrix:
```{r}
(decode <- matrix(c(0, NA, 1, 2), 2))
```
Start by doing it for one row, then by using `apply()`, finally replace it by a special accessor; what is the benefit?
Expected result:
```{r, echo=FALSE}
decode[mat + 1]
```
## Useful R base functions
In this section, I present some useful R base functions (also see [this comprehensive list in French](https://cran.r-project.org/doc/contrib/Kauffmann_aide_memoire_R.pdf) and [this one in English](https://github.com/peterhurford/adv-r-book-solutions/blob/master/03_vocab/functions.r)):
### General
```{r, eval=FALSE}
# To get some help
?topic
# Run code from the example section
example(sum)
```
```{r}
# Structure overview
str(iris) ## skimr::skim(iris) is also very useful
# List objects in the environment
ls()
# Remove objects from the environment
rm(list = ls()) ## remove all objects in the global environment
```
```{r}
# For a particular method, list available implementations for different classes
methods(summary)
# List methods available for a particular class
methods(class = "lm")
```
```{r}
# Call a function with arguments as a list
(list_of_int <- as.list(1:5))
do.call('c', list_of_int)
```
### Sequence and vector operations
```{r}
1:10 ## of type integer
seq(1, 10, by = 2) ## of type double
seq(1, 100, length.out = 10)
seq_len(5)
seq_along(21:24)
rep(1:4, 2)
rep(1:4, each = 2)
rep(1:4, 4:1)
rep_len(1:3, 8)
replicate(5, rnorm(10)) ## How to use a multiline expression?
```
```{r}
sort(c(1, 6, 8, 2, 2))
order(c(1, 6, 8, 2, 2), c(0, 0, 0, 2, 1))
rank(c(1, 6, 8, 2, 2))
rank(c(1, 6, 8, 2, 2), ties.method = "first")
sort(c("a1", "a2", "a10"))
gtools::mixedsort(c("a1", "a2", "a10")) ## not in base, but useful
which.max(c(1, 5, 3, 6, 2, 0))
which.min(c(1, 5, 3, 6, 2, 0))
unique(c(1, NA, 2, 3, 2, NA, 3))
table(rep(1:4, 4:1))
table(A = c(1, 1, 1, 2, 2), B = c(1, 2, 1, 2, 1))
sample(10)
sample(3:10, 5)
sample(3:10, 50, replace = TRUE)
```
```{r}
round(x <- runif(10, max = 100)) ## 10 random numbers between 0 and 100
round(x, digits = 2)
round(x, -1)
pmin(1:4, 4:1)
pmax(1:4, 4:1)
outer(1:4, 1:3, '+')
expand.grid(param1 = c(5, 50), param2 = c(1, 3, 10))
```
Also see [this nice Q/A on grouping functions and the *apply family](https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family) and [this book chapter about looping](https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html).
### Character operations
```{r}
paste("I", "am", "me")
paste0("test", 0)
paste0("PC", 1:10)
me <- "Florian"
glue::glue("I am {me}") ## not in base, but so useful
(x <- list.files(pattern = "\\.Rmd$", full.names = TRUE))
sub("\\.Rmd$", ".pdf", x)
(y <- sample(letters[1:4], 10, replace = TRUE))
match(y, letters[1:4])
y %in% letters[1:2]
split(1:12, rep(letters[1:3], 4))
intersect(letters[1:4], letters[3:5])
union(letters[1:4], letters[3:5])
setdiff(letters[1:4], letters[3:5])
```
### Logical operators
```{r, error=TRUE}
TRUE | stop("will go there")
TRUE || stop("won't go there") ## won't evaluate second condition if first one is TRUE
c(TRUE, FALSE, TRUE, TRUE) & c(FALSE, TRUE, TRUE, FALSE)
c(TRUE, FALSE, TRUE, TRUE) && c(FALSE, TRUE, TRUE, FALSE) ## /!\ no warning in prior R versions
```
```{r}
(x <- rnorm(10))
ifelse(x > 0, x, -x) # try to find two other equivalents
```
Beware with `ifelse()` (learn more [there](https://privefl.github.io/blog/On-the-ifelse-function/)), for example
```{r}
ifelse(FALSE, 0, 1:5)
`if`(FALSE, 0, 1:5)
if (FALSE) 0 else 1:5
```
### Exercises
1. Use `sample()`, `rep_len()` and `split()` to make a function that randomly splits some indices in a list of `K` groups of indices (like for cross-validation). [Which are the special cases that you should consider?]
```{r}
advr38pkg::split_ind(1:40, 3)
```
1. Use `replicate()` and `sample()` to get a 95% confidence interval (using bootstrapping) for the mean of the following vector:
```{r}
set.seed(1)
(x <- rnorm(10))
mean(x)
```
Expected output (approximately):
```{r, echo=FALSE}
quantile(replicate(1e6, mean(sample(x, replace = TRUE))), probs = c(0.025, 0.975))
```
1. Use `match()` and some special accessor to add a column "my_val" to this data `my_mtcars` by putting the corresponding value of the column specified in "my_col". [Can your solution be used for any number of column names?]
```{r}
my_mtcars <- mtcars[c("mpg", "hp")]
my_mtcars$my_col <- sample(c("mpg", "hp"), size = nrow(my_mtcars), replace = TRUE)
head(my_mtcars)
```
Expected result (head):
```{r, echo=FALSE}
ind <- cbind(seq_len(nrow(my_mtcars)),
match(my_mtcars[["my_col"]], names(my_mtcars)))
my_mtcars$my_val <- my_mtcars[ind]
head(my_mtcars)
```
1. In the following data frame (recall that a data frame is also a list), for the first 3 columns, replace letters by corresponding numbers based on the `code`:
```{r}
df <- data.frame(
id1 = c("a", "f", "a"),
id2 = c("b", "e", "e"),
id3 = c("c", "d", "f"),
inter = c(7.343, 2.454, 3.234),
stringsAsFactors = FALSE
)
df
(code <- setNames(1:6, letters[1:6]))
```
Expected result:
```{r, echo=FALSE}
df[-4] <- lapply(df[-4], function(col) code[col])
df
```
## Environments and scoping
Lexical scoping determines where to look for values, not when to look for them. R looks for values when the function is run, not when it’s created. This means that the output of a function can be different depending on objects outside its environment:
```{r}
h <- function() {
x <- 10
f <- function() {
x + 1
}
f()
}
```
```{r}
x <- 100
h()
```
Variable `x` is not defined inside `f` so R will look at the environment of `f` (where `f` was defined) and then at the parent environment, and so on. Here, the first `x` that is found has value `10`.
Be aware that for functions, packages environments are checked last so that you can redefine functions without noticing.
```{r}
c <- function(...) paste0(...)
c(1, 2, 3)
base::c(1, 2, 3) ## you need to explicit the package
rm(c) ## remove the new function from the environment
c(1, 2, 3)
```
You can use the `<<-` operator to change the value of an object in an upper environment:
```{r}
count1 <- 0
count2 <- 0
f <- function(i) {
count1 <- count1 + 1 ## will assign a new (temporary) count1 each time
count2 <<- count2 + 1 ## will increment count2 on top
i + 1
}
sapply(1:10, f)
c(count1, count2)
```
Finally, how does `...` work? Basically, you copy and paste what is put in `...`:
```{r}
f1 <- function(...) {
list(...)
}
f1(a = 2, b = 3)
list(a = 2, b = 3)
```
Learn more about [functions](https://bookdown.org/rdpeng/rprogdatascience/functions.html) and [scoping rules of R](https://bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html) with the [R Programming for Data Science book](https://bookdown.org/rdpeng/rprogdatascience/).
## Attributes and classes
Attributes are metadata associated with an object. You can get/set the list of attributes with `attributes()` or one particular attribute with `attr()`.
```{r}
attributes(iris)
class(iris)
attr(iris, "row.names")
```
You can use `structure()` to create an object and add some arbitrary attributes.
```{r}
structure(1:10, my_fancy_attribute = "blabla")
```
There are also some attributes with specific accessor functions to get and set values. For example, use `names(x)`, `dim(x)` and `class(x)` instead of `attr(x, "names")`, `attr(x, "dim")` and `attr(x, "class")`.
***
```{r}
class(mylm <- lm(Sepal.Length ~ ., data = iris))
```
I've just fitted a linear model in order to predict the sepal length variable of the `iris` dataset based on the other variables. Using `lm()` gets me an object of class `lm`. What are the methods I can use for this object?
```{r}
methods(class = class(mylm))
summary(mylm)
plot(mylm)
```
***
R has the easiest way to create a class and to use methods on objects of this class; it is called S3. If you want to know more about the other types of classes, see the [Advanced R book](https://adv-r.hadley.nz/).
```{r}
agent007 <- list(first = "James", last = "Bond")
agent007
```
```{r}
class(agent007) <- "Person" ## "agent007" is now an object of class "Person"
# Just make a function called <method_name>.<class_name>()
print.Person <- function(x) {
print(glue::glue("My name is {x$last}, {x$first} {x$last}."))
invisible(x)
}
agent007
```
```{r}
# Constructor of class as simple function
Person <- function(first, last) {
structure(list(first = first, last = last), class = "Person")
}
(me <- Person("Florian", "Privé"))
```
An object can have many classes:
```{r}
Worker <- function(first, last, job) {
obj <- Person(first, last)
obj$job <- job
class(obj) <- c("Worker", class(obj))
obj
}
print.Worker <- function(x) {
print.Person(x)
print(glue::glue("I am a {x$job}."))
invisible(x)
}
(worker_007 <- Worker("James", "Bond", "secret agent"))
(worker_me <- Worker("Florian", "Privé", "researcher"))
```