title	author
Part IV: Performance	Laurent Gatto

Overview

Benchmarking
Profiling
Optimisation
Memory
Rcpp

A word of caution

Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified

Robert Gentleman, in R Programming for Bioinformatics, 2008, about R's built-in C interfaces:

Since R is not compiled, in some situations its performance can be substantially improved by writing code in a compiled language. There are also reasons not to write code in other languages, and in particular we caution against premature optimization, prototyping in R is often cost effective. And in our experience very few routines need to be implemented in other languages for efficiency reasons. Another substantial reason not to use an implementation in some other language is increased complexity. The use of another language almost always results in higher maintenance costs and less stability. In addition, any extensions or enhancements of the code will require someone that is proficient in both R and the other language.

(Rcpp does make some of the above caution statements slightly less critical.)

R performance

R is not a fast language, but it is most of the time fast enough for what we want to do, in particular with respect to interactive data analysis. In such cases a slow but expressive and flexible language is way better than a fast but less expressive and flexible alternative. It is also relatively easy to avoid bad R programming idioms that make code too slow.

Timing, benchmarking

Let's compare two implementation of the square root calculation: sqrt(x) and x ^ 0.5.

x <- runif(100)
system.time(sqrt(x))
system.time(x^0.5)

Does this work? Try

x <- runif(1e5)
system.time(sqrt(x))
system.time(x^0.5)

We want to repeat timings multiple times:

summary(replicate(10, system.time(x^0.5)[["elapsed"]]))

A better approach for such cases is the microbenchmark package, which is ideal to accurately benchmark small pieces of code, in particular sub-millisecond (nanoseconds) executions (see units below).

Each expression is run 100 times (controlled by the times argument). In addition, the execution order is randomised and summary timings are reported.

x <- runif(100)
library(microbenchmark)

microbenchmark(sqrt(x),
               x ^ 0.5)

Question: where does this difference come from?

Below are some illustrations of R's language designs and implementations that account for some cuts in performance (taken from Advanced R).

Extreme dynamism

R is an extremely dynamics language where almost any symbol (there are a few reserved key words such as TRUE, if, for, ..., package namespaces, locked environment with locked binding) can be modified at any points; in particular

body, arguments, and environments of functions

f <- function() 1
f()
body(f)
body(f) <- 2
f()
formals(f) <- pairlist(x = 1)
body(f) <- quote(x + 1)
f()
f(10)

Change the S4 methods for a generic

setMethod(...)

Add new fields to an S3 object

ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
mod <- lm(weight ~ group)
mod
mod$foo <- 1
mod
summary(mod)
names(mod)

Change the class of an object

class(mod) <- c("mod", "foo")

modify objects outside or the local environment with <<-, or anywhere with assign(..., envir).

While this dynamism gives a lot of flexibility, the interpreter can not make any assumptions or optimisations. For example, let's assess the cost of methods lookup:

## function
f <- function(x) NULL

## S3 method
s3 <- function(x) UseMethod("s3")
s3.integer <- f

## S4 method
A <- setClass("A", slots = c(a = "list"))
setGeneric("s4", function(x) standardGeneric("s4"))
setMethod("s4", "A", f)
a <- A()

## S4 Reference Class
B <- setRefClass("B", methods = list(rc = f))
b <- B$new()

microbenchmark(
     fun = f(),
     S3 = s3(1L),
     S4 = s4(a),
     RC = b$rc()
)

Name lookup

To find the value of a symbol is difficult, given that everything can be modified (extreme dynamics, see above) and lexical scoping.

a <- 1
f <- function() {
    g <- function() {
        print(a) ## from global workspace
        assign("a", 2, envir = parent.frame())
        print(a) ## f's environment
        a <- 3
        print(a) ## g's environment
    }
    g()
}
f()

## [1] 1
## [1] 2
## [1] 3

And by everything, we mean really nearly everything: +, ^, (, and { in

f <- function(x, y) {
    (x + y) ^ 2
}

Extracting a single value from a data frame

microbenchmark(
    "[32, 11]"      = mtcars[32, 11],
    "$carb[32]"     = mtcars$carb[32],
    "[[c(11, 32)]]" = mtcars[[c(11, 32)]],
    "[[11]][32]"    = mtcars[[11]][32],
    ".subset"       = .subset(mtcars, 11)[32],
    ".subset2"      = .subset2(mtcars, 11)[32]
)

Improving R's performance

R implementation:

GNU R
pqR
Renjin
FastR
Riposte
CXXR

Several of these projects implement deferred evaluation, i.e. evaluations are only executed if they really need to be. We will see (and implement) and example in the Rcpp section.

Profiling

Reminder

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified

Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.

Optimisation

Optimisations often have their own costs:

Trade-off fast vs dangerous, flexibility/functionality vs performance.
Use any assumptions about the data, at the cost of generalisation.

Pat Burns reminds us that

Our first duty is to create clear, correct code. Only consider optimising your code when:

The code is debugged and stable.
Optimisation is likely to make a significant impact.

then

Find the major bottleneck: code profiling.
Try to eliminate it.
Repeat until fast enough: ideally, define fast enough in advance.

Make sure the code remains correct

x <- runif(100)
all.equal(sqrt(x), x ^ 0.5)

## [1] TRUE

and/or unit tests to compare different implementations (and regression test).

Are implementations really equivalent?

library("sequences")
gccount
gccountr <- function(x) table(strsplit(x, "")[[1]])
gccountr2 <- function(x) tabulate(factor(strsplit(x, "")[[1]]))

Checking that our different implementations give the same results:

s <- paste(sample(c("A", "C", "G", "T"),
                  100, replace = TRUE),
           collapse = "")

gccount(s)
gccountr(s)
gccountr2(s)

But are they really the same? Are we really comparing the same functionalities?

Is it worth it?

library("microbenchmark")
microbenchmark(gccount(s),
                     gccountr(s),
                     gccountr2(s),
                     times = 1e4, 
					 unit = "eps")

library("ggplot2")
mb <- microbenchmark(gccount(s),
                     gccountr(s),
                     gccountr2(s))
print(mb)
microbenchmark:::autoplot.microbenchmark(mb)

Profiling tools

Sampling or statistical profiling: at regular intervals, stop the execution and record which functions are being executing and the call stack.

Rprof (and summaryRprof): record and summarise timings at fixed intervals (default interval is 0.02 seconds).
proftools package: Tools for examining Rprof profile output. Comes with an extensive vignette.
profvis: interactive visualisation collected by Rprof, whith emphasis on lines of code. See also and Introduction to profvis.
(lineprof [*] package: each line of code is profiled. This is less precise (than Rprof) but easier to interprete. Code must be sourced with source().)
(profr package [*]: provides an alternative data structure and visual rendering for the profiling information generated by `Rprof).

[*] lineprof and profr and are now deprecated in favour of profvis.

`Rprof`

m <- matrix(rnorm(1e6), ncol = 10)

Rprof("rprof")
res <- apply(m, 1, mean, trim=.3)
Rprof(NULL)
summaryRprof("rprof")

`profvis`

Needs to source() the code or directly input the code to have access to the individual lines.

f <- function() {
  pause(0.1)
  g()
  h()
}

g <- function() {
  pause(0.1)
  h()
}

h <- function() {
  pause(0.1)
}

library("profvis")
source("lineprof-example.R")
profvis(f())

Limitations

Not profiling of C/C++ code, or primitive functions, or byte compiled code.
Anonymous functions are labelled as anonymous; name them explicitly in such cases.

Exercise

Profile the code chunk that calculates the timmed means using profvis and interpret the results.

profvis({
    m <- matrix(rnorm(1e6), ncol = 10)
    res <- apply(m, 1, mean, trim=.3)
    sum(res)
})

Optimisation

Look for existing solutions

readr::read_csv or data.table::fread instead of read_csv

Do as little as possible

gccountr vs gccountr2 example above
simpler data structures
set colClasses in read.csv
Usual suspects: names, growing objects:

Names

make_id2GO <- function(n = 1e3) { ## could be 1e4 - 1e5
    gn <- sprintf(paste0("ENSG%0", 10, "d"), sample(1e6, n))
    goid <- function(n = 10) sprintf(paste0("GO:%0", 10, "d"), sample(1e6, n))
    structure(replicate(n, goid(sample(50, 1))),
              names = gn)
}
id2GO <- make_id2GO()

We have a list of 1000 genes, and each of these genes is characterised by a set of 1 to 50 GO terms.

To obtain the go terms, we unlist the gene list.

length(id2GO)
str(head(id2GO))
str(unlist(id2GO))

This can be executed much faster if we ignore the names in the original list.

library(microbenchmark)
microbenchmark(unlist(l),
               unlist(l, use.names = FALSE),
               times = 10)

Initialise, do not grow dynamically (to avoid copies)

f1 <- function(n) {
  a <- NULL 
  for (i in 1:n) a <- c(a, sqrt(i))
  a
}

f2 <- function(n) {
  a <- numeric(n)
  for (i in 1:n) a[i] <- sqrt(i)
  a
}

microbenchmark(f1(1e3), f2(1e3))
microbenchmark(f1(1e4), f2(1e4))

Pass-by-reference with environments

When passing an environment as function argument, it is not copied: all its values are accessible within the function and can be persistently modified.

e <- new.env()
e$x <- 1
f <- function(myenv) myenv$x <- 2
f(e)
e$x

## [1] 2

This is used in the eSet et al. microarray data structures to store the expression data.

Vectorisation

f3 <- function(n)
  sapply(seq_len(n), sqrt)

f4 <- function(n) sqrt(n)

Code vectorisation is not only about avoiding loops at all cost, and replacing them with *apply. As we have seen, this does not make any real difference in terms of speed.

Difference between vectorisation in high level code, to improve clarity (apply, Vectorise, ...) and, vectorise to improve performance, which involved re-writing for loops in C/C++ (see below).

Re-implementing the code in C/C++

See below

Parallelisation

See the R-parallel material.

Byte-code compilation

The compile::cmpfun function compiles the body of a closure and returns a new closure with the same formals and the body replaced by the compiled body expression. It does not always provide a speed improvement, but is very easy to implement.

lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

lapply2_c <- compiler::cmpfun(lapply2)

x <- list(1:10, letters, c(FALSE, TRUE), NULL)


microbenchmark(
  lapply2(x, is.null),
  lapply2_c(x, is.null),
  lapply(x, is.null))

Note that all base R functions are aleady byte compiled. This can be observed with the <bytecode: 0x3858bb8> attribute of a function.

Memory

(See Chapter 18 in Advanced R for more details)

Assessing memory needs is useful to save memory in general and limit memory access (read/write), which is one common bottleneck in R.

Requirement:

library("pryr")
library("profvis")

Object size

x <- 1:1e5
object.size(x)
print(object.size(x), units = "Kb")
object_size(x)

But, object.size does not account for shared elements, nor for the size of environments.

ll <- list(x, x, x)
print(object.size(ll), units = "Kb")
object_size(ll)

But, this does not hold when there's no shared components:

x <- 1:1e6
y <- list(1:1e6, 1:1e6, 1:1e6)
object_size(x)
object_size(y)

Environments:

e <- new.env()
object.size(e)
object_size(e)
e$a <- 1:1e6
object.size(e)
object_size(e)

Exercises

What is the object size of an empty numeric vector?

How does the size of a numeric vector grow with it size (say from 0 to 50)?

Memory usage

To get the total size of all object that were created by R that currently take space in memory:

mem_used()

To track memory change

mem_change(v <- 1:1e6)
mem_change(rm(v))

rm(list = ls())
mem_change(x <- 1:1e6)
mem_change(y <- x)
mem_change(rm(x))
mem_change(rm(y))

Garbage collection

When objects in memory are not accessed from R anymore, there is no need to explicitly free up that memory chunk explicity. This is done automatically by the garbage collector, as illustrated in the examples above. There is no need to call it explicityly with gc(); the only effect of this is for R to explicitly return memory to the OS.

Modifiation in place

What happens in this cas? Is x copied (and hence more memory is used up), or is it modified in place?

x <- 1:10
c(address(x), refs(x))
x[5] <- 0L
c(address(x), refs(x))


y <- x
c(address(x), refs(x))
c(address(y), refs(y))

x[5] <- 1L
c(address(x), refs(x))
c(address(y), refs(y))


x <- 1:5
y <- x
c(address(x), refs(x))
rm(y)
c(address(x), refs(x)) ## should be 1

x <- 1:5
y <- x
z <- x
c(address(x), refs(x)) ## should be 3

tracemem tracks memory location of objects:

x <- 1:10
tracemem(x)
x[5] <- 0L

y <- x
x[5] <- 10L

address(x)
address(y)

Exercise

We have quantified the substantial cost in execution time when growing data dynamically. Trace the impact of dynamically growing a vector on the memory. You can use f1() and f2() above, and trace a.

Rcpp

See here.

Big data

CRAN High-Performance and Parallel Computing task view.
Storing data in database or databases-like structures: RMySQL, RdbiPgSQL, \ldots, RSQLite, qldf, data.table (the data.table::fread, when read.table is slow, also scan), dplyr, ... packages
The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions
The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them
netCDF data files: ncdf and RNetCDF packages
hdf5 format: rhdf5 package
mmap memory-mapped files/devices I/O
hadoop and R
See http://r-pbd.org/ and the pbdDemo package/vignette.
Bioconductor in the cloud
Bioconductor docker containers
...

Files

04-perf.md

Latest commit

History

04-perf.md

File metadata and controls

Overview

A word of caution

R performance

Timing, benchmarking

Extreme dynamism

Name lookup

Extracting a single value from a data frame

Improving R's performance

Profiling

Reminder

Optimisation

Make sure the code remains correct

Are implementations really equivalent?

Profiling tools

Rprof

profvis

Limitations

Exercise

Optimisation

Look for existing solutions

Do as little as possible

Names

Initialise, do not grow dynamically (to avoid copies)

Pass-by-reference with environments

Vectorisation

Re-implementing the code in C/C++

Parallelisation

Byte-code compilation

Memory

Object size

Exercises

Memory usage

Garbage collection

Modifiation in place

Exercise

Rcpp

Big data

`Rprof`

`profvis`