-
Notifications
You must be signed in to change notification settings - Fork 2
/
PCA_on_trials.Rmd
219 lines (163 loc) · 6.31 KB
/
PCA_on_trials.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "PCA on the 4SU Trial Data"
author: "Anna-Leigh Brown"
date: "18/11/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
if (!require("pacman")){
install.packages("pacman")
} #if pacman is not installed, install it
pacman::p_load(tidyverse, data.table, ggplot2,janitor,ggpubr,ggrepel,
devtools,readxl) #pacman will now install and load these packages if you don't have them
if (!require("annotables")) {
devtools::install_github("stephenturner/annotables")
}
```
The chunk above is code. But this is just a normal text document.
Let's load in the file:
I'm using a function called fread from the data.table package to read in the data.
I do this cause it reads data faster, it's not relevant for a smaller table like
this one is, but can be handy once we get around to larger datasets.
```{r}
bdnf_counts <- read_excel("C:/Users/mlomb/OneDrive/Desktop/MRes project/bdnf_full_counts (1).xlsx")
bdnf_counts <- bdnf_counts %>% filter(ensgene != "ENSG00000251562")
```
As before we're going to put the "ensgene" into a 'rowname' and then
do the PCA analysis.
Now I'm going to write this using the pipe operator "%>%".
Your "hot key" to type that in is: Ctrl + Shift + M
This command:
```{r}
pca_piped <- bdnf_counts %>%
column_to_rownames('ensgene') %>%
t() %>%
prcomp(.,scale = TRUE)
```
is exactly equivalent to this:
```{r}
bdnf_counts <- column_to_rownames(bdnf_counts, 'ensgene')
pca <- prcomp(t(bdnf_counts), scale=TRUE)
```
I will prove it to you. Double equal sign means 'does this equal this'. For example
```{r}
y <- 42
y == 42
x <- 6
y == x * 7
```
So is the rotation for both PCA objects the same?
```{r}
pca$rotation == pca_piped$rotation
```
Moving back. We're going to use this tool instead because it has more whistles.
PCAtools
https://bioconductor.org/packages/release/bioc/html/PCAtools.html
You would install it from a source called Bioconductor - which is a collection
of R packages for biological analysis. Uncomment if you want to run
This is the vignette for the package
https://bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html
```{r}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("PCAtools")
```
So let's do a PCA plot using meta data. You'll find formatted meta data in the drive
```{r}
path_to_meta_you_downloaded <- "C:/Users/mlomb/Downloads/trial_meta - trial_meta.csv.csv" #update this
meta_df <- fread(path_to_meta_you_downloaded)
meta_df <- meta_df %>%
column_to_rownames('sample')
```
```{r}
library(PCAtools)
p <- PCAtools::pca(bdnf_counts, metadata = meta_df, removeVar = 0.1)
```
What does the removeVar argument do? Why did I include it?
As before the screeplot
```{r}
screeplot(p, axisLabSize = 18, titleLabSize = 22)
```
Now we're going to make what is called an "eigencorplot"
This is going to correlate our metadata variables with our principle components.
Star means "significant". What kind of correlation is it? Check out the help documentation
by entering `?PCAtools::eigencorplot` in the console
```{r}
PCAtools::eigencorplot(p,components = getComponents(p, seq_len(4)),
metavars = c('labelling_time',"cond"))
```
Ohh my, PC1 (our most important variable) really is the labelling time.
We can do a biplot with the 2 PC's that are correlating - e.g. labelling and
PC5 which is the effect of BDNF
```{r}
biplot(p,
x = "PC1",
y = "PC2",
colby = 'cond')
```
BDNF's effect seems to be really tightly on the first 2 hours of BDNF + 4SU.
Let's go ahead and find out what the genes are that are correlated with PC1 - eg
what's happening at 6 hours labelling time?
Let's directly take the loadings of each of the genes (a PCA is a linear combination).
I'm going to also show you now how to put the symbol back on the table.
We shall use the wonder of the `left_join`
```{r}
pc1_gene_loadings = p$loadings %>%
dplyr::select(PC1) %>%
rownames_to_column('ensgene') %>%
left_join(annotables::grch38 %>% dplyr::select(ensgene,symbol)) %>%
#https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
arrange(-PC1)
```
Now we've seen a join let's do the same thing with our original count table
```{r}
bdnf_counts = bdnf_counts %>%
rownames_to_column('ensgene') %>%
left_join(annotables::grch38 %>% dplyr::select(ensgene,symbol))
```
I want to look at the expression of the top 15 genes on PC1 so I'm going
to plot their expression.
Practice removing the pipes and see what every line does
```{r}
bdnf_counts %>%
left_join(pc1_gene_loadings %>% head(15)) %>%
filter(!is.na(PC1)) %>%
select(-ensgene) %>%
melt(id.vars = c("symbol","PC1")) %>%
mutate(symbol = fct_reorder(symbol,-PC1)) %>%
separate(variable, into = c("condition","labelling_time")) %>%
ggplot(aes(x = labelling_time, y = value, fill = condition)) +
geom_col(position = "dodge2") +
facet_wrap(~symbol)
```
4su labelling is overwhelming the pc1 so i do analysis without those 2 timepoints, removing BDNF_6h, CONTROL_6h
```{r}
bdnf_counts <- select(bdnf_counts, -c(BDNF_6h, CONTROL_6h))
meta_df <- meta_df %>% filter(labelling_time != 6)
bdnf_counts <- select(bdnf_counts, -(symbol)) %>%
unique() %>%
remove_rownames() %>%
column_to_rownames("ensgene")
```
what genes account for PC3 w/o BDNF6h, Control6h
```{r}
pc3_gene_loadings = p$loadings %>%
dplyr::select(PC3) %>%
rownames_to_column('ensgene') %>%
left_join(annotables::grch38 %>% dplyr::select(ensgene,symbol)) %>%
#https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti
arrange(-PC3)
bdnf_counts %>%
rownames_to_column("ensgene") %>%
left_join(pc3_gene_loadings %>% head(15)) %>%
filter(!is.na(PC3)) %>%
select(-ensgene) %>%
melt(id.vars = c("symbol","PC3")) %>%
mutate(symbol = fct_reorder(symbol,-PC3)) %>%
separate(variable, into = c("condition","BDNF")) %>%
ggplot(aes(x = BDNF, y = value, fill = condition)) +
geom_col(position = "dodge2") +
facet_wrap(~symbol)
```