forked from bioinformatics-core-shared-training/basicr
-
Notifications
You must be signed in to change notification settings - Fork 40
/
Session2.2-stats.Rmd
332 lines (243 loc) · 10 KB
/
Session2.2-stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
title: "Introduction to Solving Biological Problems Using R - Day 2"
author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić,
Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output:
html_notebook:
toc: yes
toc_float: yes
---
# 2. Statistics
##Built-in support for statistics
- R is a statistical programming language:
+ Classical statistical tests are built-in
+ Statistical modeling functions are built-in
+ Regression analysis is fully supported
+ Additional mathematical packages are available (`MASS`, Waves, sparse matrices, etc)
##Distribution functions
- Most commonly used distributions are built-in, functions have stereotypical names, e.g. for normal distribution:
+ **`pnorm`** - cumulative distribution for x
+ **`qnorm`** - inverse of pnorm (from probability gives x)
+ **`dnorm`** - distribution density
+ **`rnorm`** - random number from normal distribution
![distributions](images/distributions.png)
- Available for variety of distributions: `punif` (uniform), `pbinom` (binomial), `pnbinom` (negative binomial), `ppois` (poisson), `pgeom` (geometric), `phyper` (hyper-geometric), `pt` (T distribution), pf (F distribution)
- 10 random values from the Normal distribution with mean 10 and standard deviation 5:
```{r}
rnorm(10, mean=10, sd=5)
```
- The probability of drawing exactly 10 from this distribution:
```{r}
dnorm(10, mean=10, sd=5)
```
```{r}
dnorm(100, mean=10, sd=5)
```
- The probability of drawing a value smaller than 10:
```{r}
pnorm(10, mean=10, sd=5)
```
- The inverse of `pnorm()`:
```{r}
qnorm(0.5, mean=10, sd=5)
```
- How many standard deviations for statistical significance?
```{r}
qnorm(0.95, mean=0, sd=1)
```
## Example
Recall our histogram of Wind Speed from yesterday:
- The data look to be roughly normally-distributed
- An assumption we rely on for various statistical tests
```{r}
weather <- read.csv("ozone.csv")
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
```
## Create a normal distribution curve
- If our data are normally-distributed, we can calculate the probability of drawing particular values.
+ e.g. a Wind Speed of 10
```{r}
windMean <- mean(weather$Wind)
windSD <- sd(weather$Wind)
dnorm(10, mean=windMean, sd=windSD)
```
- We can overlay this on the histogram using `points` as we just saw:
```{r}
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
points(10, dnorm(10, mean=windMean, sd=windSD),
col="red", pch=16)
```
- We can repeat the calculation for a vector of values
+ remember that functions in R are often ***vectorized***
+ use `lines` in this case rather than `points`
```{r}
xs <- c(0,5,10,15,20)
ys <- dnorm(xs, mean=windMean, sd=windSD)
hist(weather$Wind, col="steelblue", xlab="Wind Speed",
main="Distribution of Wind Speed",
breaks = 20, freq=FALSE)
lines(xs, ys, col="red")
```
- For a smoother curve, use a longer vector:
+ we can generate x values using the `seq()` function
- Inspecting the data in this manner is usually acceptable to assess normality
+ the fit doesn't have to be exact
+ the shapiro test is also available `?shapiro.test` (but not really recommended by statisticians)
```{r}
hist(weather$Wind,col="steelblue",xlab="Wind Speed",
main="Distribution of Wind Speed",breaks = 20,freq=FALSE)
xs <- seq(0,20, length.out = 10000)
ys <- dnorm(xs, mean=windMean,sd=windSD)
lines(xs,ys,col="red")
```
## Simple testing
- If we want to compute the probability of observing a particular Wind Speed, from the same distribution, we can use the standard formula to calculate a t statistic:
$$t = \frac{\bar{x} -\mu_0}{s / \sqrt(n)}$$
- Say a Wind Speed of 2; which from the histogram seems to be unlikely
```{r}
t <- (windMean - 2) / (windSD/sqrt(length(weather$Wind)))
t
```
- ...or use the **`t.test()`** function to compute the statistic and corresponding p-value
```{r}
t.test(weather$Wind, mu=2)
```
- A variety of tests are supported the R authors have tried to make them as consistent as possible
```{r}
?var.test
?t.test
?wilcox.test
?prop.test
?cor.test
?chisq.test
?fisher.test
```
- Bottom-line: Pretty much any statistical test you care to name will probably be in R
+ This is not supposed to be a statistics course (sorry!)
+ None of them are particular harder than others to use
+ The difficulty is deciding which test to use:
+ whether the assumptions of the test are met, etc.
+ Consult your local statistician if not sure
+ An upcoming course that will help
+ [Introduction to Statistical Analysis](http://bioinformatics-core-shared-training.github.io/IntroductionToStats/)
+ Some good references:
+ [Statistical Analysis Using R (Course from the Babaraham Bioinformatics Core)](http://training.csx.cam.ac.uk/bioinformatics/event/1827771)
+ [Quick-R guide to stats](http://www.statmethods.net/stats/index.html)
+ [Simple R eBook](https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf)
+ [R wiki](https://en.wikibooks.org/wiki/R_Programming/Descriptive_Statistics)
+ [R tutor](http://www.r-tutor.com/elementary-statistics)
+ [Statistical Cheatsheet](https://rawgit.com/bioinformatics-core-shared-training/intermediate-stats/master/cheatsheet.pdf)
- R will do whatever you ask it, it will never check the assumptions or help you to interpret the result
![](images/clippy.jpg)
## Example analysis
- We have already seen that men in our `patients` dataset tend to be heavier than women
- We can **test this formally** in R
+ N.B. the weight of people in a population tends to be normally distributed, so we can probably be safe to use a parametric test
![](images/males-versus-females.png)
We need to run this if we don't have the patients data in our R environment
```{r}
patients <- read.delim("patient-info.txt")
```
We can test if the variance of the two groups is the same
+ this will influence what flavour of t-test we use
```{r}
var.test(patients$Weight~patients$Sex)
```
```{r}
t.test(patients$Weight~patients$Sex, var.equal=FALSE)
```
If were unwilling to make a assumption of normality, a non-parametric test could be used.
```{r}
wilcox.test(patients$Weight~patients$Sex)
```
## Linear Regression
- Linear modeling is supported by the function `lm()`:
+ `example(lm)`
- The output assumes you know a fair bit about the subject
- `lm` is really useful for plotting lines of best fit to XY data, in order to determine intercept, gradient and Pearson's correlation coefficient
+ This is very easy in R
- Three steps to plotting with a best fit line:
+ Plot XY scatter-plot data
+ Fit a linear model
+ Add bestfit line data to plot with `abline()` function
- Let's see a toy example:-
```{r}
x <- c(1, 2.3, 3.1, 4.8, 5.6, 6.3)
y <- c(2.6, 2.8, 3.1, 4.7, 5.1, 5.3)
plot(x,y, xlim=c(0,10), ylim=c(0,10))
```
- The `~` is used to define a formula; i.e. "y is given by x"
+ Take care about the order of `x` and `y` in the `plot` and `lm` expressions
```{r}
plot(x,y, xlim=c(0,10), ylim=c(0,10))
myModel <- lm(y~x)
abline(myModel, col="red")
```
The generic `summary` function give an overview of the model
- particular components are accessible if we know their names
```{r}
summary(myModel)
names(myModel) # Names of the objects within myModel
```
- alternatively, various helper functions are provided.
```{r}
coef(myModel) # Coefficients
resid(myModel) # Residuals
fitted(myModel) # Fitted values
residuals(myModel) + fitted(myModel) # what values does this give?
```
You can also get some diagnostic information on the model.
- Some explanation can be found [here](http://data.library.virginia.edu/diagnostic-plots/) and [here](http://strata.uga.edu/6370/rtips/regressionPlots.html)
```{r}
par(mfrow=c(2,2))
plot(myModel)
```
- R has a very powerful formula syntax for describing statistical models
- Suppose we had two explanatory variables `x` and `z`, and one response variable `y`
- We can describe a relationship between, say, `y` and `x` using a tilde `~`, placing the response variable on the left of the tilde and the explanatory variables on the right:
+ y~x
- It is very easy to extend this syntax to do multiple regressions, ANOVAs, to include interactions, and to do many other common modelling tasks. For example
```{r}
y~x #If x is continuous, this is linear regression
y~x #If x is categorical, ANOVA
y~x+z #If x and z are continuous, multiple regression
y~x+z #If x and z are categorical, two-way ANOVA
y~x+z+x:z # : is the symbol for the interaction term
y~x*z # * is a shorthand for x+z+x:z
```
## Exercise: Exercise 6
- There are suggestions that Ozone level could be influenced by Temperature:
```{r}
plot(weather$Temp, weather$Ozone,xlab="Temperature",ylab="Ozone level",pch=16)
```
- Perform a linear regression analysis to assess this:
+ Fit the linear model and print a summary of the output
+ Plot the two variables and overlay a best-fit line
+ What is the equation of the best-fit line in the form
$$ y = ax + b$$
- Calculate the correlation between the two variables using the function cor (?cor)
+ can you add text to the plot to show the correlation coefficient?
+ HINT: The `paste` can be used to join strings of text together, or variables
```{r}
paste("Hello","World")
age <- 35
paste("My age is", age)
```
```{r}
## Your Answer Here ##
```
![](images/exercise6.png)
![](images/exercise6b.png)
## More words of warning
***Correlation != Causation***
![](images/spurious.png)
http://tylervigen.com/spurious-correlations
![](images/nobel-prize.jpeg)
[So if I want to win a nobel prize, I should eat even more chocolate?!?!?](http://www.businessinsider.com/chocolate-consumption-vs-nobel-prizes-2014-4?IR=T)
But no-one would ever take such trends seriously....would they?
![Cutting-down on Ice Cream was recommended as a safeguard against polio!](images/icecreamvspolio.jpg)