Skip to content

Latest commit

 

History

History
504 lines (367 loc) · 8.29 KB

IntroductionControlFlowR.md

File metadata and controls

504 lines (367 loc) · 8.29 KB

Control Flow and loops in R ############################

Control Flow

The standard if else

p.test <- function(p) {
    if (p <= 0.05) 
        print("yeah!!!!") else if (p >= 0.9) 
        print("high!!!!") else print("somewhere in the middle")
}

Now pick a number and put it in p.test

p.test(0.5)
## [1] "somewhere in the middle"

ifelse()

A better and vectorized way of doing this is ifelse(test, yes, no) function. ifelse() is far more useful as it is vectorized.

p.test.2 <- function(p) {
    ifelse(p <= 0.05, print("yippee"), print("bummer, man"))
}

Test this with the following sequence. See what happens if you use if vs. ifelse().

x <- runif(10, 0, 1)
x
##  [1] 0.27332 0.14155 0.89000 0.07041 0.79419 0.25013 0.02324 0.86766
##  [9] 0.41114 0.56165

Now try it with p.test() (uses if).

p.test(x)
## Warning: the condition has length > 1 and only the first element will be used
## Warning: the condition has length > 1 and only the first element will be used
## [1] "somewhere in the middle"

Now try it with p.test.2()

p.test.2(x)
## [1] "yippee"
## [1] "bummer, man"
##  [1] "bummer, man" "bummer, man" "bummer, man" "bummer, man" "bummer, man"
##  [6] "bummer, man" "yippee"      "bummer, man" "bummer, man" "bummer, man"

Other vectorized ways of control flow.

There are many times that you may think you need to use an if with (iterating with a for loop... see below), or ifelse, but there may be far better ways.

For instance, say you are doing some simulations for a power analysis, and you want to know how often your simulation gives you a p-value less than 0.05.

p.1000 <- runif(n = 1000, min = 0, max = 1)

The line above generates 1000 random values between 0-1, which we will pretend are our p-values for differential expression from our simulation.

You may try and count how often it less than 0.05

p.ifelse <- ifelse(p.1000 < 0.05, 1, 0)  # If it is less than 0.05, then you get a 1, otherwise 0. 

Our approximate false positives. Should be close to 0.05

sum(p.ifelse)/length(p.1000)
## [1] 0.059

In R, think index!

However the best and fastest way to accomplish this is to use the index, by setting up the Boolean (TRUE/FALSE) in the index of the vector.

length(p.1000[p.1000 < 0.05])/length(p.1000)
## [1] 0.059

Same number, faster and simpler computation.

Simple loops

while() function..

I tend to avoid these, so you will not see them much here

i <- 1
while (i <= 10) {
    print(i)
    i <- i + 0.5
}
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5
## [1] 5.5
## [1] 6
## [1] 6.5
## [1] 7
## [1] 7.5
## [1] 8
## [1] 8.5
## [1] 9
## [1] 9.5
## [1] 10

for loop

If I run a loop I most often use for(){} automatically iterates across a list (in this case the sequence from 1:10).

for (i in 1:10) {
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

If you do not want to use integers, how might you do it using the for()?

for (i in seq(from = 1, to = 5, by = 0.5)) {
    print(i)
}
## [1] 1
## [1] 1.5
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5

behavior of strings.

Using strings is a bit more involved in R, compared to other languages. For instance the following does not do what you want::

for (letter in "word") {
    print(letter)
}
## [1] "word"

(try letters for a hoot.)

Instead in R, we have to split the word "word" into single characters using strsplit(), i.e::

strsplit("word", split = "")
## [[1]]
## [1] "w" "o" "r" "d"

So for the for loop we would do the following:

for (letter in strsplit("word", split = "")) {
    print(letter)
}
## [1] "w" "o" "r" "d"

More on avoiding loops

Many would generate random numbers like so.

for (i in 1:100) {
    print(rnorm(n = 1, mean = 0, sd = 1))
}
## [1] -0.1837
## [1] -0.9313
## [1] 1.648
## [1] -0.6964
## [1] 0.2112
## [1] 0.3441
## [1] 1.036
## [1] 0.7439
## [1] 0.5859
## [1] -0.6087
## [1] -0.4014
## [1] 1.44
## [1] -0.3906
## [1] -1.861
## [1] -0.739
## [1] -1.204
## [1] 0.07794
## [1] -1.65
## [1] 1.261
## [1] 0.6753
## [1] 0.6736
## [1] 0.3238
## [1] -1.316
## [1] 0.2965
## [1] 1.499
## [1] 0.4326
## [1] 0.4488
## [1] 0.8873
## [1] -1.304
## [1] -0.347
## [1] 0.3491
## [1] 0.24
## [1] 0.1425
## [1] -0.2785
## [1] -0.5072
## [1] -1.775
## [1] -0.04051
## [1] 0.9452
## [1] 0.3322
## [1] -0.01994
## [1] -0.2308
## [1] -0.4053
## [1] -0.5685
## [1] -1.631
## [1] -0.1484
## [1] 0.434
## [1] 1.653
## [1] 1.57
## [1] 0.1308
## [1] -1.059
## [1] -0.7157
## [1] -0.8316
## [1] 0.06561
## [1] 0.8243
## [1] 0.1841
## [1] 1.048
## [1] 0.1612
## [1] -0.9553
## [1] -0.7569
## [1] -0.288
## [1] -1.837
## [1] 0.7301
## [1] -2.103
## [1] -1.869
## [1] -1.298
## [1] -1.077
## [1] -0.2139
## [1] -0.9419
## [1] 0.4694
## [1] -1.344
## [1] -0.08514
## [1] -2.055
## [1] -0.803
## [1] -0.7281
## [1] 1.778
## [1] -1.116
## [1] 1.33
## [1] 0.1535
## [1] -2.897
## [1] 0.7305
## [1] 1.228
## [1] 1.697
## [1] -0.8183
## [1] -1.013
## [1] -0.634
## [1] -0.942
## [1] -0.3395
## [1] 0.1396
## [1] 1.022
## [1] 0.9868
## [1] -0.7778
## [1] 1.075
## [1] -0.1029
## [1] 0.2644
## [1] 0.01165
## [1] 0.8025
## [1] -1.24
## [1] -0.8865
## [1] 0.981
## [1] 0.5333

We are cycling through and generating one random number at each iteration. Look at the indices, and you can see we keep generating vectors of length 1.

better/cleaner/faster to generate them all at one time

rnorm(n = 100, mean = 0, sd = 1)
##   [1] -0.08683 -1.55262 -1.16909  0.30451 -1.14555  0.76682  0.12643
##   [8] -0.61174 -0.29103 -0.10707 -0.03397 -0.05926  0.27294  1.32693
##  [15] -0.53284  1.83234  0.43959 -0.88991  0.25383  0.96709 -0.23210
##  [22] -1.00190 -1.32289  1.80030  1.15272 -1.82907  0.75989  1.35966
##  [29]  0.53943  0.01429 -0.58707 -0.11886 -0.70367 -2.38988  0.08033
##  [36] -0.22795 -0.62166 -0.19832 -1.95990 -0.85127  0.94236  0.37771
##  [43]  0.32617 -0.08393 -0.54506 -2.58781 -0.58433  0.20985 -0.41613
##  [50]  0.60527  0.51713  1.57950 -0.61079 -0.28564 -0.16444  0.55007
##  [57]  0.57258  0.58513 -0.86728 -0.81185 -0.29333 -1.23935  0.46169
##  [64] -1.53586 -0.32583  0.17629 -0.85579  1.04989  1.22120  1.53359
##  [71] -2.37276  1.44393  1.47506  0.40110 -0.10157  0.35485 -0.72068
##  [78] -1.27910  0.63152 -0.65216  1.60160  0.27109  0.50904 -1.00531
##  [85]  0.76743 -0.78954 -0.01159  1.06944  1.15661 -0.91031  1.54919
##  [92] -0.84334  2.19994  0.26716  0.02081  0.53577  0.07840 -0.79387
##  [99] -1.18941  1.24745

What if we wanted to put all of these numbers in a vector?

The inadvisable approach

First we initialize a vector to store all of the numbers. Why do we initialize this vector first?

n <- 1e+05
x <- rep(NA, n)

The step above creates a vector of n NA's. They will be replaced sequentially with the random numbers as we generate them (using a function like the above one).

head(x)
## [1] NA NA NA NA NA NA

Now we run the for loop.

for (i in 1:n) {
    x[i] <- rnorm(n = 1, mean = 0, sd = 1)
}

for each i in the index, one number is generated, and placed in x

head(x)
## [1]  0.2848 -0.5432  1.1391 -1.0901  0.8515  0.5490

However this is computationally inefficient in R. Which has vectorized operations.

system.time(

for (i in 1:n){
	x[i] <- rnorm(n=1, mean=0, sd=1)})
##    user  system elapsed 
##   0.562   0.023   0.584

We can also use the replicate function to do the same thing. Easier syntax to write.

system.time(z <- replicate(n, rnorm(n = 1, mean = 0, sd = 1)))
##    user  system elapsed 
##   0.561   0.035   0.841

This is ~20% faster.

The way to do it

However, since R is vectorized, both of the will be far slower than:

system.time(y <- rnorm(n, 0, 1))
##    user  system elapsed 
##   0.010   0.000   0.011

About 65 times faster than the for loop

The general rule in R is that loops are slower than the apply family of functions (for small to medium data sets, not true for very large data) which are slower than vectorized computations.