-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-ModelsStatistics.Rmd
137 lines (85 loc) · 5.39 KB
/
11-ModelsStatistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# DESeq2 Model Design
DESeq2 can handle some complicated model designs.
This is the code we ran in the differential expression chapter but you can slot in other designs. Below, only the designs are shown.
```{R, eval=FALSE}
DESeqDataset = DESeqDataSetFromMatrix(countData=sampleCounts, colData=sampleTable, design = ~ condition)
```
You can check the names of the comparisons created by DESeq2 like this.
```{R, eval=FALSE}
resultsNames(dds)
```
And then you can pull the results for a specific comparison like this (default = the last comparison).
```{R, eval=FALSE}
res = results(dds, name="condition_treatment_vs_control")
```
## One Factor
This design matches what we did earlier. It works with 2 levels such as treatment/contol but it also works if you have more than 2 levels. With 3 levels for instance, it will compare two of the levels to the reference level. Make sure you set a reference level (see the Differential Expression chapter) or it will use the first one alphabetically.
```{R, eval=FALSE}
~ treatment
```
**Examples**
*1 factor, 2 levels*
water stressed mice
control mice (reference level)
*1 factor, 3 levels*
water stressed mice
social isolation stressed mice
control mice (reference level)
## Two Factors
You can model two factors such as genotype and treatment where you care about the main effects of both factors. The order matters. DESeq2 will adjust the model for the first one (in this case, "genotype") and then adjust for the second one (in this case, "treatement"). The last factor should be the one you are most interested in.
Main effects only:
```{R, eval=FALSE}
~ genotype + treatment
```
Main effects + interaction:
```{R, eval=FALSE}
~ genotype + treatment + genotype:treatment
```
Controlling for one of the factors:
```{R, eval=FALSE}
~ batch + treatment
```
**Examples**
In this example you would likely be interested in the effect of the genotype and of the treatment. You might also be interested in the effect of the genotype*treatment interaction.
*2 factors, 2 levels each*
Treatment
water stressed mice
control mice (reference level)
Genotype
Wildtype mouse
Mutant mouse
In this example you would probably only be interested in the effect of the treatment but would want to control for the batch effect.
*2 factors, 2 levels each*
Treatment
water stressed mice
control mice (reference level)
Batch
Lab technician 1
Lab technician 2
## A little bit on statistical tests
The default statistical test is the WALD test. It checks to see whether a factor or set of factors significantly contribute to a model. If they aren't significantly contributing, they can be left out. This test is used by DESeq2 for pair-wise comparisons. It is an approximation of the likelihood ratio test (LRT), which can also be applied in DESeq2.
You can use LRT to test whether factors are affecting your gene expression levels by comparing a full model with a reduced model that has one or more factors removed and calculating the likelihood ratio between the two models. If the likelihood of the full model is not much more than the likelihood of the reduced model, then the factors you are testing are not improving the model and can be left out.
The Wald test is an approximation of LRT that requires one model estimation where the LRT requires two. As such, it is faster and simpler and usually yields a comparable result. But the LRT is more robust and can go beyond pairwise comparisons.
## Time series
While times series can be analyzed by a series of pairwise comparisons (ie. comparing each timepoint back to time 0), the LRT model allows us to perform time series and other non-pairwise comparisons using more than two levels of a factor. For example, we can look at gene expression patterns across multiple timepoints.
The full model:
```{R, eval=FALSE}
~ treatment + time + treatment:time
```
The reduced model:
```{R, eval=FALSE}
~ treatment + time
```
The LRT could be run like this (after running the full model as usual):
```{R, eval=FALSE}
dds_lrt_time = DESeq(dds, test="LRT", reduced = ~ treatment + time)
```
The we can pull out genes that have similar patterns using something like this:
```{R, eval=FALSE}
geneclusters = degPatterns(log2normCounts, metadata = meta, time="time", col="treatment")
```
## More resources
The [DESeq2 Vignette](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) has a lot of information.
[This video](https://www.youtube.com/watch?v=OPt51qc5YJ8) has a good discussion of different models, including interaction terms.
More information on the Wald Statistic is [here](https://medium.com/@analyttica/understanding-wald-test-2e3fa7723516).
More information on analyzing time series expression data is [here](https://hbctraining.github.io/DGE_workshop_salmon_online/lessons/08b_time_course_analyses.html).