title | author |
---|---|
Part IV: Performance |
Laurent Gatto |
- Benchmarking
- Profiling
- Optimisation
- Memory
- Rcpp
Knuth, Donald. Structured Programming with go to
Statements, ACM
Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified
Robert Gentleman, in R Programming for Bioinformatics, 2008, about R's built-in C interfaces:
Since R is not compiled, in some situations its performance can be substantially improved by writing code in a compiled language. There are also reasons not to write code in other languages, and in particular we caution against premature optimization, prototyping in R is often cost effective. And in our experience very few routines need to be implemented in other languages for efficiency reasons. Another substantial reason not to use an implementation in some other language is increased complexity. The use of another language almost always results in higher maintenance costs and less stability. In addition, any extensions or enhancements of the code will require someone that is proficient in both R and the other language.
(Rcpp
does make some of the above caution statements slightly less
critical.)
R is not a fast language, but it is most of the time fast enough for what we want to do, in particular with respect to interactive data analysis. In such cases a slow but expressive and flexible language is way better than a fast but less expressive and flexible alternative. It is also relatively easy to avoid bad R programming idioms that make code too slow.
Let's compare two implementation of the square root calculation:
sqrt(x)
and x ^ 0.5
.
x <- runif(100)
system.time(sqrt(x))
system.time(x^0.5)
Does this work? Try
x <- runif(1e5)
system.time(sqrt(x))
system.time(x^0.5)
We want to repeat timings multiple times:
summary(replicate(10, system.time(x^0.5)[["elapsed"]]))
A better approach for such cases is the
microbenchmark
package,
which is ideal to accurately benchmark small pieces of code, in
particular sub-millisecond (nanoseconds) executions (see units below).
Each expression is run 100 times (controlled by the times
argument). In addition, the execution order is randomised and summary
timings are reported.
x <- runif(100)
library(microbenchmark)
microbenchmark(sqrt(x),
x ^ 0.5)
Question: where does this difference come from?
Below are some illustrations of R's language designs and implementations that account for some cuts in performance (taken from Advanced R).
R is an extremely dynamics language where almost any symbol (there are
a few reserved key words such as TRUE
, if
, for
, ..., package
namespaces, locked environment with locked binding) can be modified at
any points; in particular
- body, arguments, and environments of functions
f <- function() 1
f()
body(f)
body(f) <- 2
f()
formals(f) <- pairlist(x = 1)
body(f) <- quote(x + 1)
f()
f(10)
- Change the S4 methods for a generic
setMethod(...)
- Add new fields to an S3 object
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
mod <- lm(weight ~ group)
mod
mod$foo <- 1
mod
summary(mod)
names(mod)
- Change the class of an object
class(mod) <- c("mod", "foo")
- modify objects outside or the local environment with
<<-
, or anywhere withassign(..., envir)
.
While this dynamism gives a lot of flexibility, the interpreter can not make any assumptions or optimisations. For example, let's assess the cost of methods lookup:
## function
f <- function(x) NULL
## S3 method
s3 <- function(x) UseMethod("s3")
s3.integer <- f
## S4 method
A <- setClass("A", slots = c(a = "list"))
setGeneric("s4", function(x) standardGeneric("s4"))
setMethod("s4", "A", f)
a <- A()
## S4 Reference Class
B <- setRefClass("B", methods = list(rc = f))
b <- B$new()
microbenchmark(
fun = f(),
S3 = s3(1L),
S4 = s4(a),
RC = b$rc()
)
To find the value of a symbol is difficult, given that everything can be modified (extreme dynamics, see above) and lexical scoping.
a <- 1
f <- function() {
g <- function() {
print(a) ## from global workspace
assign("a", 2, envir = parent.frame())
print(a) ## f's environment
a <- 3
print(a) ## g's environment
}
g()
}
f()
## [1] 1
## [1] 2
## [1] 3
And by everything, we mean really nearly everything: +
, ^
, (
, and {
in
f <- function(x, y) {
(x + y) ^ 2
}
microbenchmark(
"[32, 11]" = mtcars[32, 11],
"$carb[32]" = mtcars$carb[32],
"[[c(11, 32)]]" = mtcars[[c(11, 32)]],
"[[11]][32]" = mtcars[[11]][32],
".subset" = .subset(mtcars, 11)[32],
".subset2" = .subset2(mtcars, 11)[32]
)
R implementation:
Several of these projects implement deferred evaluation,
i.e. evaluations are only executed if they really need to be. We will
see (and implement) and example in the Rcpp
section.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified
Knuth, Donald. Structured Programming with go to
Statements, ACM
Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.
Optimisations often have their own costs:
- Trade-off fast vs dangerous, flexibility/functionality vs performance.
- Use any assumptions about the data, at the cost of generalisation.
Pat Burns reminds us that
Our first duty is to create clear, correct code. Only consider optimising your code when:
- The code is debugged and stable.
- Optimisation is likely to make a significant impact.
then
- Find the major bottleneck: code profiling.
- Try to eliminate it.
- Repeat until fast enough: ideally, define fast enough in advance.
x <- runif(100)
all.equal(sqrt(x), x ^ 0.5)
## [1] TRUE
and/or unit tests to compare different implementations (and regression test).
library("sequences")
gccount
gccountr <- function(x) table(strsplit(x, "")[[1]])
gccountr2 <- function(x) tabulate(factor(strsplit(x, "")[[1]]))
Checking that our different implementations give the same results:
s <- paste(sample(c("A", "C", "G", "T"),
100, replace = TRUE),
collapse = "")
gccount(s)
gccountr(s)
gccountr2(s)
But are they really the same? Are we really comparing the same functionalities?
Is it worth it?
library("microbenchmark")
microbenchmark(gccount(s),
gccountr(s),
gccountr2(s),
times = 1e4,
unit = "eps")
library("ggplot2")
mb <- microbenchmark(gccount(s),
gccountr(s),
gccountr2(s))
print(mb)
microbenchmark:::autoplot.microbenchmark(mb)
Sampling or statistical profiling: at regular intervals, stop the execution and record which functions are being executing and the call stack.
-
Rprof
(andsummaryRprof
): record and summarise timings at fixed intervals (defaultinterval
is 0.02 seconds). -
proftools
package: Tools for examiningRprof
profile output. Comes with an extensive vignette. -
profvis
: interactive visualisation collected byRprof
, whith emphasis on lines of code. See also and Introduction to profvis. -
(
lineprof
[*] package: each line of code is profiled. This is less precise (thanRprof
) but easier to interprete. Code must be sourced withsource()
.) -
(
profr
package [*]: provides an alternative data structure and visual rendering for the profiling information generated by `Rprof).
[*] lineprof
and
profr
and
are now deprecated in favour of
profvis
.
m <- matrix(rnorm(1e6), ncol = 10)
Rprof("rprof")
res <- apply(m, 1, mean, trim=.3)
Rprof(NULL)
summaryRprof("rprof")
Needs to source()
the code or directly input the code to have access
to the individual lines.
f <- function() {
pause(0.1)
g()
h()
}
g <- function() {
pause(0.1)
h()
}
h <- function() {
pause(0.1)
}
library("profvis")
source("lineprof-example.R")
profvis(f())
- Not profiling of C/C++ code, or primitive functions, or byte compiled code.
- Anonymous functions are labelled as anonymous; name them explicitly in such cases.
Profile the code chunk that calculates the timmed means using
profvis
and interpret the results.
profvis({
m <- matrix(rnorm(1e6), ncol = 10)
res <- apply(m, 1, mean, trim=.3)
sum(res)
})
readr::read_csv
ordata.table::fread
instead ofread_csv
-
gccountr
vsgccountr2
example above -
simpler data structures
-
set
colClasses
inread.csv
-
Usual suspects: names, growing objects:
make_id2GO <- function(n = 1e3) { ## could be 1e4 - 1e5
gn <- sprintf(paste0("ENSG%0", 10, "d"), sample(1e6, n))
goid <- function(n = 10) sprintf(paste0("GO:%0", 10, "d"), sample(1e6, n))
structure(replicate(n, goid(sample(50, 1))),
names = gn)
}
id2GO <- make_id2GO()
We have a list of 1000 genes, and each of these genes is characterised by a set of 1 to 50 GO terms.
To obtain the go terms, we unlist
the gene list.
length(id2GO)
str(head(id2GO))
str(unlist(id2GO))
This can be executed much faster if we ignore the names in the original list.
library(microbenchmark)
microbenchmark(unlist(l),
unlist(l, use.names = FALSE),
times = 10)
f1 <- function(n) {
a <- NULL
for (i in 1:n) a <- c(a, sqrt(i))
a
}
f2 <- function(n) {
a <- numeric(n)
for (i in 1:n) a[i] <- sqrt(i)
a
}
microbenchmark(f1(1e3), f2(1e3))
microbenchmark(f1(1e4), f2(1e4))
When passing an environment as function argument, it is not copied: all its values are accessible within the function and can be persistently modified.
e <- new.env()
e$x <- 1
f <- function(myenv) myenv$x <- 2
f(e)
e$x
## [1] 2
This is used in the eSet
et al. microarray data structures to store
the expression data.
f3 <- function(n)
sapply(seq_len(n), sqrt)
f4 <- function(n) sqrt(n)
Code vectorisation is not only about avoiding loops at all cost, and
replacing them with *apply
. As we have seen, this does not make any
real difference in terms of speed.
Difference between vectorisation in high level code, to improve
clarity (apply
, Vectorise
, ...) and, vectorise to improve
performance, which involved re-writing for loops in C/C++ (see below).
See below
See the R-parallel material.
The compile::cmpfun
function compiles the body of a closure and
returns a new closure with the same formals and the body replaced by
the compiled body expression. It does not always provide a speed
improvement, but is very easy to implement.
lapply2 <- function(x, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], ...)
}
out
}
lapply2_c <- compiler::cmpfun(lapply2)
x <- list(1:10, letters, c(FALSE, TRUE), NULL)
microbenchmark(
lapply2(x, is.null),
lapply2_c(x, is.null),
lapply(x, is.null))
Note that all base R functions are aleady byte compiled. This can be
observed with the <bytecode: 0x3858bb8>
attribute of a function.
(See Chapter 18 in Advanced R for more details)
Assessing memory needs is useful to save memory in general and limit memory access (read/write), which is one common bottleneck in R.
Requirement:
library("pryr")
library("profvis")
x <- 1:1e5
object.size(x)
print(object.size(x), units = "Kb")
object_size(x)
But, object.size
does not account for shared elements, nor for the
size of environments.
ll <- list(x, x, x)
print(object.size(ll), units = "Kb")
object_size(ll)
But, this does not hold when there's no shared components:
x <- 1:1e6
y <- list(1:1e6, 1:1e6, 1:1e6)
object_size(x)
object_size(y)
Environments:
e <- new.env()
object.size(e)
object_size(e)
e$a <- 1:1e6
object.size(e)
object_size(e)
- What is the object size of an empty numeric vector?
- How does the size of a numeric vector grow with it size (say from 0 to 50)?
To get the total size of all object that were created by R that currently take space in memory:
mem_used()
To track memory change
mem_change(v <- 1:1e6)
mem_change(rm(v))
rm(list = ls())
mem_change(x <- 1:1e6)
mem_change(y <- x)
mem_change(rm(x))
mem_change(rm(y))
When objects in memory are not accessed from R anymore, there is no
need to explicitly free up that memory chunk explicity. This is done
automatically by the garbage collector, as illustrated in the examples
above. There is no need to call it explicityly with gc()
; the only
effect of this is for R to explicitly return memory to the OS.
What happens in this cas? Is x
copied (and hence more memory is used
up), or is it modified in place?
x <- 1:10
c(address(x), refs(x))
x[5] <- 0L
c(address(x), refs(x))
y <- x
c(address(x), refs(x))
c(address(y), refs(y))
x[5] <- 1L
c(address(x), refs(x))
c(address(y), refs(y))
x <- 1:5
y <- x
c(address(x), refs(x))
rm(y)
c(address(x), refs(x)) ## should be 1
x <- 1:5
y <- x
z <- x
c(address(x), refs(x)) ## should be 3
tracemem
tracks memory location of objects:
x <- 1:10
tracemem(x)
x[5] <- 0L
y <- x
x[5] <- 10L
address(x)
address(y)
We have quantified the substantial cost in execution time when growing
data dynamically. Trace the impact of dynamically growing a vector on
the memory. You can use f1()
and f2()
above, and trace a
.
See here.
- CRAN High-Performance and Parallel Computing task view.
- Storing data in database or databases-like structures:
RMySQL
,RdbiPgSQL
, \ldots,RSQLite
,qldf
,data.table
(thedata.table::fread
, whenread.table
is slow, alsoscan
),dplyr
, ... packages - The
ff
package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions - The
bigmemory
package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and usesexternal pointer
objects to refer to them netCDF
data files:ncdf
andRNetCDF
packageshdf5
format:rhdf5
packagemmap
memory-mapped files/devices I/O- hadoop and R
- See http://r-pbd.org/ and the pbdDemo package/vignette.
- Bioconductor in the cloud
- Bioconductor docker containers
- ...