rprog.Rmd

# R programming

This chapter is about base R stuff that I find important and that is often overlooked or unknown to most R users.

Learn more with the [Advanced R book](https://adv-r.hadley.nz/).

```{r, include=FALSE}
source("knitr-options.R")
source("spelling-check.R")
```


## Common mistakes

> If you are using R and you think you're in hell, [this is a map](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) for you. 
>
> -- Patrick Burns

### Equality

```{r}
(0.1 + 0.2) == 0.3
print(c(0.1, 0.2, 0.3), digits = 20)
all.equal(0.1 + 0.2, 0.3)  ## equality with some tolerance
all.equal(0.1 + 0.2, 0.3, tolerance = 0)
all.equal(0.1 + 0.2, 0.4)
isTRUE(all.equal(0.1 + 0.2, 0.4))  ## if you want a boolean, use isTRUE()
dplyr::near(0.1 + 0.2, 0.3)  ## similar, from the {dplyr} package
```

### Arguments

```{r}
min(-1, 5, 118)
max(-1, 5, 118)
mean(-1, 5, 118)
median(-1, 5, 118)
```

How to explain the issue with `mean` and `median`? Let us look at the parameters of these functions:

```{r}
args(max)
args(mean)
args(median)
```

One solution is to always use a vector:

```{r}
min(c(-1, 5, 118))
max(c(-1, 5, 118))
mean(c(-1, 5, 118))
median(c(-1, 5, 118))
```

### Others

```{r}
sample(1:10)
sample(10)
sample(10.1)
```

```{r}
n <- 10
1:n-1  ## is (1:n) - 1, so 0:(n - 1)
1:(n-1)
seq_len(n - 1)
1:0
seq_len(0)  ## prefer using seq_len(n) rather than 1:n (e.g. in for-loops)
seq_along(5:7)  ## a shortcut for seq_len(length(.))
```


## R base objects

### Types

There are several "atomic" types of data: `logical`, `integer`, `double` and `character` (in this order, see below). There are also `raw` and `complex`, but they are rarely used.

You cannot mix types in an atomic vector, but you can in a list. Coercion will automatically occur when you mix types in a vector:

```{r}
(a <- FALSE)
typeof(a)

(b <- 1:10)
typeof(b)
c(a, b)  ## FALSE is coerced to an integer -> 0

(c <- 10.5)
typeof(c)
(d <- c(b, c))  ## coerced to numeric

c(d, "a")  ## coerced to character

c(list(1), "a")

50 < "7"  ## does "50" < "7"
```

### Exercise

Use the automatic type coercion to convert this boolean matrix to a numeric one (with 0s and 1s). [What do you need to change in your code to get an integer matrix instead of a numeric one?]

```{r}
(mat <- matrix(sample(c(TRUE, FALSE), 12, replace = TRUE), nrow = 3))
```


## Base objects and accessors

### Objects

- "atomic" vector: vector of one base type (see above).

- scalar: this doesn't exist, this is a vector of length 1.

- matrices / arrays: **a vector** with some dimensions (attribute).

```{r}
(vec <- 1:12)
dim(vec) <- c(3, 4)
vec
class(vec)
dim(vec) <- c(3, 2, 2)
vec
class(vec)
```

- list: vector of elements with possibly different types in it. 

- data.frame: **a list** whose elements have the same lengths, and formatted somewhat as a matrix.

```{r}
head(iris)
dim(iris)
length(iris)  ## a data.frame is also a list
```

### Accessors

1. The `[` accessor is used to access a subset of the data **with the same class**.

```{r}
(x <- 1:5)
x[2:3]
x[2:8]  ## /!\ no warning
(y <- matrix(1:12, nrow = 3))
y[4:9]  ## a matrix is also a vector
(l <- list(a = 1, b = "I love R", c = matrix(1:6, nrow = 2)))
l[2:3]
head(iris)
head(iris[3:4])
class(iris[5])
```

You can also use a logical and character vectors to index these objects. 

```{r}
(x <- 1:4)
x[c(FALSE, TRUE, FALSE, TRUE)]
x[c(FALSE, TRUE)]  ## logical vectors are recycled
head(iris[c("Petal.Length", "Species")])
```

2. The `[[` accessor is used to access **a single element**.

```{r}
(x <- 1:10)
x[[3]]
l[[2]]
iris[["Species"]]
```

```{r, echo=FALSE, fig.cap="Indexing lists in R. [Source: https://goo.gl/8UkcHq]"}
knitr::include_graphics("https://pbs.twimg.com/media/DQ5en8XWAAICIaJ.jpg")
```

3. Beware partial matching

```{r}
x <- list(aardvark = 1:5)
x$a
x[["a"]]
x[["a", exact = FALSE]]
```

4. Special use of the `[` accessor for array-like data.

```{r}
(mat <- matrix(1:12, 3))
mat[1, ]
mat[, 1:2]
mat[1, 1:2]
mat[1, 1:2, drop = FALSE]
(two_col_ind <- cbind(c(1, 3, 2), c(1, 4, 2)))
mat[two_col_ind]
mat[]
mat[] <- 2
mat
```

If you use arrays with more than two dimensions, simply add an additional comma for every new dimension.

### Exercises

1. Use the dimension attribute to make a function that computes the sums every n elements of a vector. In which order are matrix elements stored? [Which are the special cases that you should consider?]

    ```{r}
    advr38pkg::sum_every(1:10, 2)
    ```

2. Compute the means of every numeric columns of the `iris` dataset. Expected result:

    ```{r, echo=FALSE}
    colMeans(iris[sapply(iris, is.numeric)])
    ```

3. Convert the following matrix to a vector by replacing (0, 0) -> 0; (0, 1) -> 1; (1, 1) -> 2; (1, 0) -> NA.

    ```{r}
    mat <- matrix(0, 10, 2); mat[c(5, 8, 9, 12, 15, 16, 17, 19)] <- 1; mat
    ```

    by using this matrix:
    
    ```{r}
    (decode <- matrix(c(0, NA, 1, 2), 2))
    ```
    
    Start by doing it for one row, then by using `apply()`, finally replace it by a special accessor; what is the benefit?
    
    Expected result:
    
    ```{r, echo=FALSE}
    decode[mat + 1]
    ```


## Useful R base functions

In this section, I present some useful R base functions (also see [this comprehensive list in French](https://cran.r-project.org/doc/contrib/Kauffmann_aide_memoire_R.pdf) and [this one in English](https://github.com/peterhurford/adv-r-book-solutions/blob/master/03_vocab/functions.r)):

### General

```{r, eval=FALSE}
# To get some help
?topic

# Run code from the example section
example(sum)
```

```{r}
# Structure overview
str(iris)  ## skimr::skim(iris) is also very useful

# List objects in the environment
ls()

# Remove objects from the environment
rm(list = ls())  ## remove all objects in the global environment
```

```{r}
# For a particular method, list available implementations for different classes 
methods(summary)
# List methods available for a particular class
methods(class = "lm")
```

```{r}
# Call a function with arguments as a list
(list_of_int <- as.list(1:5))
do.call('c', list_of_int)
```


### Sequence and vector operations

```{r}
1:10  ## of type integer
seq(1, 10, by = 2)  ## of type double
seq(1, 100, length.out = 10)
seq_len(5)
seq_along(21:24)
rep(1:4, 2)
rep(1:4, each = 2)
rep(1:4, 4:1)
rep_len(1:3, 8)
replicate(5, rnorm(10))  ## How to use a multiline expression?
```

```{r}
sort(c(1, 6, 8, 2, 2))
order(c(1, 6, 8, 2, 2), c(0, 0, 0, 2, 1))
rank(c(1, 6, 8, 2, 2))
rank(c(1, 6, 8, 2, 2), ties.method = "first")
sort(c("a1", "a2", "a10"))
gtools::mixedsort(c("a1", "a2", "a10"))  ## not in base, but useful
which.max(c(1, 5, 3, 6, 2, 0))
which.min(c(1, 5, 3, 6, 2, 0))
unique(c(1, NA, 2, 3, 2, NA, 3))
table(rep(1:4, 4:1))
table(A = c(1, 1, 1, 2, 2), B = c(1, 2, 1, 2, 1))
sample(10)
sample(3:10, 5)
sample(3:10, 50, replace = TRUE)
```

```{r}
round(x <- runif(10, max = 100))  ## 10 random numbers between 0 and 100
round(x, digits = 2)
round(x, -1)
pmin(1:4, 4:1)
pmax(1:4, 4:1)
outer(1:4, 1:3, '+')
expand.grid(param1 = c(5, 50), param2 = c(1, 3, 10))
```

Also see [this nice Q/A on grouping functions and the *apply family](https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family) and [this book chapter about looping](https://bookdown.org/rdpeng/rprogdatascience/loop-functions.html).

### Character operations

```{r}
paste("I", "am", "me")
paste0("test", 0)
paste0("PC", 1:10)
me <- "Florian"
glue::glue("I am {me}")  ## not in base, but so useful
(x <- list.files(pattern = "\\.Rmd$", full.names = TRUE))
sub("\\.Rmd$", ".pdf", x)
(y <- sample(letters[1:4], 10, replace = TRUE))
match(y, letters[1:4])
y %in% letters[1:2]
split(1:12, rep(letters[1:3], 4))
intersect(letters[1:4], letters[3:5])
union(letters[1:4], letters[3:5])
setdiff(letters[1:4], letters[3:5])
```

### Logical operators

```{r, error=TRUE}
TRUE | stop("will go there")
TRUE || stop("won't go there")  ## won't evaluate second condition if first one is TRUE
c(TRUE, FALSE, TRUE, TRUE) & c(FALSE, TRUE, TRUE, FALSE) 
c(TRUE, FALSE, TRUE, TRUE) && c(FALSE, TRUE, TRUE, FALSE)  ## /!\ no warning in prior R versions
```

```{r}
(x <- rnorm(10))
ifelse(x > 0, x, -x)  # try to find two other equivalents
```

Beware with `ifelse()` (learn more [there](https://privefl.github.io/blog/On-the-ifelse-function/)), for example 

```{r}
ifelse(FALSE, 0, 1:5)
`if`(FALSE, 0, 1:5)
if (FALSE) 0 else 1:5
```

### Exercises

1. Use `sample()`, `rep_len()` and `split()` to make a function that randomly splits some indices in a list of `K` groups of indices (like for cross-validation). [Which are the special cases that you should consider?]

    ```{r}
    advr38pkg::split_ind(1:40, 3)
    ```

1. Use `replicate()` and `sample()` to get a 95% confidence interval (using bootstrapping) for the mean of the following vector:

    ```{r}
    set.seed(1)
    (x <- rnorm(10))
    mean(x)
    ```
   
    Expected output (approximately): 
    
    ```{r, echo=FALSE}
    quantile(replicate(1e6, mean(sample(x, replace = TRUE))), probs = c(0.025, 0.975))
    ```

1. Use `match()` and some special accessor to add a column "my_val" to this data `my_mtcars` by putting the corresponding value of the column specified in "my_col". [Can your solution be used for any number of column names?]

    ```{r}
    my_mtcars <- mtcars[c("mpg", "hp")]
    my_mtcars$my_col <- sample(c("mpg", "hp"), size = nrow(my_mtcars), replace = TRUE)
    head(my_mtcars)
    ```

    Expected result (head):
    
    ```{r, echo=FALSE}
    ind <- cbind(seq_len(nrow(my_mtcars)), 
                 match(my_mtcars[["my_col"]], names(my_mtcars)))
    my_mtcars$my_val <- my_mtcars[ind]
    head(my_mtcars)
    ```

1. In the following data frame (recall that a data frame is also a list), for the first 3 columns, replace letters by corresponding numbers based on the `code`:

    ```{r}
    df <- data.frame(
      id1 = c("a", "f", "a"),
      id2 = c("b", "e", "e"), 
      id3 = c("c", "d", "f"),
      inter = c(7.343, 2.454, 3.234),
      stringsAsFactors = FALSE
    )
    df
    (code <- setNames(1:6, letters[1:6]))
    ```
    
    Expected result:
    
    ```{r, echo=FALSE}
    df[-4] <- lapply(df[-4], function(col) code[col])
    df
    ```


## Environments and scoping

Lexical scoping determines where to look for values, not when to look for them. R looks for values when the function is run, not when it’s created. This means that the output of a function can be different depending on objects outside its environment:

```{r}
h <- function() {
  x <- 10
  f <- function() {
    x + 1
  }
  f()
}
```

```{r}
x <- 100
h()
```

Variable `x` is not defined inside `f` so R will look at the environment of `f` (where `f` was defined) and then at the parent environment, and so on. Here, the first `x` that is found has value `10`.

Be aware that for functions, packages environments are checked last so that you can redefine functions without noticing.

```{r}
c <- function(...) paste0(...)
c(1, 2, 3)
base::c(1, 2, 3)  ## you need to explicit the package
rm(c)  ## remove the new function from the environment
c(1, 2, 3)
```

You can use the `<<-` operator to change the value of an object in an upper environment:

```{r}
count1 <- 0
count2 <- 0
f <- function(i) {
  count1 <-  count1 + 1  ## will assign a new (temporary) count1 each time
  count2 <<- count2 + 1  ## will increment count2 on top
  i + 1
}
sapply(1:10, f)
c(count1, count2)
```

Finally, how does `...` work? Basically, you copy and paste what is put in `...`:

```{r}
f1 <- function(...) {
  list(...)
}
f1(a = 2, b = 3)
list(a = 2, b = 3)
```

Learn more about [functions](https://bookdown.org/rdpeng/rprogdatascience/functions.html) and [scoping rules of R](https://bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html) with the [R Programming for Data Science book](https://bookdown.org/rdpeng/rprogdatascience/).


## Attributes and classes

Attributes are metadata associated with an object. You can get/set the list of attributes with `attributes()` or one particular attribute with `attr()`.

```{r}
attributes(iris)
class(iris)
attr(iris, "row.names")
```

You can use `structure()` to create an object and add some arbitrary attributes.

```{r}
structure(1:10, my_fancy_attribute = "blabla")
```

There are also some attributes with specific accessor functions to get and set values. For example, use `names(x)`, `dim(x)` and `class(x)` instead of `attr(x, "names")`, `attr(x, "dim")` and `attr(x, "class")`.

***

```{r}
class(mylm <- lm(Sepal.Length ~ ., data = iris))
```

I've just fitted a linear model in order to predict the sepal length variable of the `iris` dataset based on the other variables. Using `lm()` gets me an object of class `lm`. What are the methods I can use for this object?

```{r}
methods(class = class(mylm))
summary(mylm)
plot(mylm)
```

***

R has the easiest way to create a class and to use methods on objects of this class; it is called S3. If you want to know more about the other types of classes, see the [Advanced R book](https://adv-r.hadley.nz/).

```{r}
agent007 <- list(first = "James", last = "Bond")
agent007
```

```{r}
class(agent007) <- "Person"  ## "agent007" is now an object of class "Person"
# Just make a function called <method_name>.<class_name>()
print.Person <- function(x) {
  print(glue::glue("My name is {x$last}, {x$first} {x$last}."))
  invisible(x)
}

agent007
```

```{r}
# Constructor of class as simple function
Person <- function(first, last) {
  structure(list(first = first, last = last), class = "Person")
}
(me <- Person("Florian", "Privé"))
```

An object can have many classes:

```{r}
Worker <- function(first, last, job) {
  obj <- Person(first, last)
  obj$job <- job
  class(obj) <- c("Worker", class(obj))
  obj
}
print.Worker <- function(x) {
  print.Person(x) 
  print(glue::glue("I am a {x$job}."))
  invisible(x)
}

(worker_007 <- Worker("James", "Bond", "secret agent"))
(worker_me <- Worker("Florian", "Privé", "researcher"))
```