-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-CoexpressionNetworks.Rmd
500 lines (323 loc) · 15.6 KB
/
09-CoexpressionNetworks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
<!-- Maybe add the control in though you still need binary (or numeric???) traits or time series; consider adding in t cells -->
<!-- Something on time series-->
<!-- Add in gene plots such as those shown here: https://bioinformaticsworkbook.org/dataAnalysis/RNA-Seq/RNA-SeqIntro/wgcna.html#gsc.tab=0-->
<!-- maybe add in a bioinformagician explatory video -->
# Coexpression Networks
Coexpression networking analysis looks at genes with similar expression patterns across treatments. These genes are likely all controlled together and comprise a gene expression "module".
We can correlate these modules with phenotypic traits or treatments. We can then find out what processes and pathways interesting modules are enriched for.
We can also identify driver genes that are critical for interesting modules and for the trait of interest.
We will use WGCNA (weighted gene co-expression network analysis) in R. This creates scale-free networks of gene expression patterns (networks with a lot of genes with low connectivity and a few genes ("hubs") with very high connectivity).
When doing this type of analysis, there are a few things to keep in mind.
+ You should have at least 20 samples
+ This is an unsupervised methods: you don't do differential expression first
+ Remove genes with low counts
+ Start with normalized read counts
## Preparing the environment
Work in a screen
```
screen -S coexpr
```
Activate the environment
```
conda activate vis
```
Open up R
```
R
```
Load libraries
```
library(DESeq2)
library(WGCNA)
library(CorLevelPlot)
library(RColorBrewer)
```
Allow parallel processing
```
allowWGCNAThreads(nThreads=8)
```
## Read in data
The data includes raw gene counts from https://www.nature.com/articles/s41586-020-03148-w#MOESM4. The data is bulk RNA-seq from alveolar macrophages. I grabbed patients with COVID-19 (72) and those with other pneumonias (102).
It has been transposed so that the genes are the column and the rows are the samples.
```
expr = read.table("/home/data/de-2403/wgcna/covidxpneumonia.csv")
```
## Check Samples and Genes
The goodSamplesGenes function from WGCNA finds samples and genes with a high number of missing entries and genes with zero variance. It will iterate if necessary.
```
gsg = goodSamplesGenes(expr)
summary(gsg)
gsg$allOK
table(gsg$goodSamples)
table(gsg$goodGenes)
```
All the samples are good. 40,374 genes are good.
Let's get the good genes.
```
exprgood = expr[,gsg$goodGenes == TRUE]
```
<!-- We should check for batch effects and correct them -->
## Sample outliers
We'll look for outliers in a PCA and a hierarchal tree.
```
pdf("outliers.pdf", width=10, height=10)
#pca
pca = prcomp(exprgood)
pca.var = pca$sdev^2
pca.var.percent = round(pca.var/sum(pca.var)*100, digits = 2)
pca.dat = as.data.frame(pca$x)
ggplot(pca.dat, aes(PC1, PC2)) +
geom_point() +
geom_text(label = rownames(pca.dat)) +
labs(x = paste0('PC1: ', pca.var.percent[1], ' %'),
y = paste0('PC2: ', pca.var.percent[2], ' %'))
# tree
mytree = hclust(dist(exprgood), method = "average")
plot(mytree)
dev.off()
```
![](./Figures/outliers.pdf){width=100%}
Let's exclude 304011161 and 304020336. Both are "Other Pneumonia" bringing our count of patients in that category down to 100.
```
remove = c('304011161', '304020336')
exprfinal = exprgood[!(rownames(exprgood) %in% remove),]
```
## Get the phenotype data
Read in the metadata table.
```
meta = read.table("/home/data/de-2403/wgcna/metadata_covidxpneumonia.csv",sep=",", header=TRUE, row.names=1)
```
Remove the two problematic samples.
```
metafinal = meta[!(rownames(meta) %in% remove),]
```
Double-check that the samples are in the same order in the expression and the metadata.
```
all(rownames(metafinal) == rownames(exprfinal))
```
## Filter and normalize genes
We will create a DESeq2 dataset so that we can use some of the DESeq2 functions. Since we are not doing differential expression, we will not specify a model. We have to transpose the expression data for DESeq2 to have samples as columns and genes as rows.
```
dds = DESeqDataSetFromMatrix(countData = t(exprfinal),
colData = metafinal,
design = ~ 1)
```
We will futher filter the genes to keep only genes that have at least 10 reads in 40% of the samples (70 samples). This allows us to look at genes that are not expressed or have really low expression in one condition but not in the other.
```
ddsFilt = dds[rowSums(counts(dds) >= 10) >= 70,]
nrow(ddsFilt)
```
We will do VST (Variance Stabilizing Transformation) normalization using a DESeq2 function.
```
ddsFiltNorm = vst(ddsFilt)
```
Get the normalized counts and transform them back into the WGCNA format of samples as rows and genes as columns.
```
ncnt = t(assay(ddsFiltNorm))
```
## Soft-threshold
Choose some powers to test. We'll use a soft threshold, which emphasizes (weights) the strong correlations.
```
power = c(c(1:10), seq(from = 12, to = 30, by = 2))
```
Run them to see which power best models a scale-free network. We'll be using "signed" networks which take into account whether gene expression is going up or down.
<!-- Negatively correlated genes could be split into a different module and if you want to pull them back together you might consider also doing an unsigned network. -->
```
sft = pickSoftThreshold(ncnt,
powerVector = power,
networkType = "signed",
verbose = 5)
```
Plot them.
```
pdf("softThreshold.pdf")
ggplot(sft$fitIndices, aes(x=Power, y=SFT.R.sq, label = Power)) +
geom_point() +
geom_text(nudge_x = -0.2) +
geom_hline(yintercept = 0.8, color = 'red') +
labs(x = "Soft Threshold (Power)", y = "Scale-Free Topology Model Fit (Signed) (R^2)") +
ggtitle("Scale Independence")
ggplot(sft$fitIndices, aes(x=Power, y=mean.k., label = Power)) +
geom_point() +
geom_text(nudge_x = -0.2) +
labs(x = 'Soft Threshold (Power)', y = 'Mean Connectivity') +
ggtitle("Mean Connectivity")
dev.off()
```
![](./Figures/softThreshold.pdf){width=100%}
Manually choose a threshold. You want to have a high R2 for fitting the scale-free model while not letting the connectivity drop too low.
```
soft = 10
```
## Create Network
<!-- You might need to convert the normalized counts to numeric but ours already are. -->
We need to do a correlation. There are 2 different "cor" functions, one in base R and one in WGCNA. We want to make sure it is using the latter so we'll stash the base R function in the temp_cor variable and make sure that "cor" points to the WGCNA version.
```
temp_cor = cor
cor = WGCNA::cor
```
We will use blockwiseModules to compute several steps at once. It allows for splitting the analysis into blocks consisting of clusters of genes that are closely related. This greatly reduces memory requirements while still giving a good approximation of what you would get if you didn't split it into blocks. It is recommended that you run it in a single block if possible. If you don't have enough memory run it in the fewest blocks possible.
More information on blockwiseModules can be found here:
https://peterlangfelder.com/2018/11/25/blockwise-network-analysis-of-large-data/
We'll try to run it in a single block.
blockwiseModules runs several steps, including:
Creates an adjancency matrix
Converts the matrix into a topological overlap matrix (TOM)
Creates a dendogram
Identifies modules
Merges modules
```
bwnet = blockwiseModules(ncnt,
maxBlockSize = 12000,
TOMType = "signed",
power = soft,
mergeCutHeight = 0.25,
numericLabels = FALSE,
randomSeed = 42,
verbose = 3)
```
Replace the base R cor back to its original name.
```
cor = temp_cor
```
## Get the module eigengenes
Module eigengenes are essentially a hypothetical central, representative expression pattern for each module.
<!-- they are actually defined as the first principal component of the expression matrix of each module -->
module_eigengenes = bwnet$MEs
We can get number of genes for each module. Note that the grey module has all the genes that aren't connected to a module.
```
table(bwnet$colors)
```
Plot the dendogram and the modules, both before and after merging.
```
pdf("dendrogram.pdf")
plotDendroAndColors(bwnet$dendrograms[[1]],
cbind(bwnet$unmergedColors, bwnet$colors),
c("unmerged", "merged"),
dendroLabels = FALSE,
addGuide = TRUE,
hang= 0.03,
guideHang = 0.05)
dev.off()
```
![](./Figures/dendrogram.pdf){width=100%}
## Compare modules to phenotypes
Get sample and gene counts
```
nSamples = nrow(ncnt)
nGenes = ncol(ncnt)
```
Perform a correlation
<!-- is the p for pearson? or pairwise complete obs?) -->
```
module.trait.corr = cor(module_eigengenes, metafinal, use = 'p')
module.trait.corr.pvals = corPvalueStudent(module.trait.corr, nSamples)
```
First, merge the eigengenes with the metadata. We'll need to create row names from column 1 then delete column 1. We'll also delete columns 8 (grey module), 9 (patient id), and 14 (another id that is mostly missing).
```
heatmap.data = merge(module_eigengenes, metafinal, by = 'row.names')
rownames(heatmap.data) = heatmap.data$Row.names
heatmap.data = heatmap.data[,c(-1,-8,-9,-14)]
```
Plot a heatmap showing the correlation between the metadata and the modules. We'll exclude the grey module since it is just a collection of unconnected genes and isn't really a module (column 7). The other modules are columns 1-6 and metadata is 8-13. We will focus on diagnosis (column 11), which is either "COVID-19" or "Other pneumonia".
We have the following phenotypes:
Age
Sex
Diagnosis
Days from first intubation
Superinfection
We'll need to convert the phenotypes to numeric, binary data. Age and days from first intubation are already numeric. The remaining variables are all binary and we can see their categories with the table function. Note that superinfection has a lot of missing data since it is only filled in for COVID-19 cases.
```
table(heatmap.data$sex)
table(heatmap.data$diagnosis)
table(heatmap.data$superinfection)
```
Let's change the binary categories to numbers.
```
heatmap.data$sex = gsub('Male', '0', heatmap.data$sex)
heatmap.data$sex = gsub('Female', '1', heatmap.data$sex)
heatmap.data$sex = as.numeric(heatmap.data$sex)
heatmap.data$diagnosis = gsub('Other Pneumonia', '0', heatmap.data$diagnosis)
heatmap.data$diagnosis = gsub('COVID-19', '1', heatmap.data$diagnosis)
heatmap.data$diagnosis = as.numeric(heatmap.data$diagnosis)
heatmap.data$superinfection = gsub('Primary Only', '0', heatmap.data$superinfection)
heatmap.data$superinfection = gsub('Superinfection', '1', heatmap.data$superinfection)
heatmap.data$superinfection = as.numeric(heatmap.data$superinfection)
```
Age is a little trickier. We'll break that into people ages 65+ and under 65 given that people over 65 and older have an increased risk of severe COVID-19 disease. The age range is from 25.79535 to 120.10335 (!). I guess that age 120 is not impossible so we'll go with it. The next oldest is 87.18891 (which seems more reasonable).
```
heatmap.data$age[heatmap.data$age<65]=0
heatmap.data$age[heatmap.data$age>=65]=1
```
Check how many are 65+ (1) and how many are <65 (0).
```
table(heatmap.data$age)
```
days_from_first_intubation ranges from 0 to Inf (the next highest is 90). Let's make this a binary <=10 and >10.
```
heatmap.data$days_from_first_intubation[heatmap.data$days_from_first_intubatio<=10]=0
heatmap.data$days_from_first_intubation[heatmap.data$days_from_first_intubatio>10]=1
```
Check the numbers.
```
table(heatmap.data$days_from_first_intubation)
```
<!--I tried using binarizeCategoricalColumns but it didn't seem to keep things in order; maybe I was trying it on metafinal rather than heatmap.data -->
```
pdf("heatmap.pdf")
CorLevelPlot(heatmap.data,
x = names(heatmap.data)[7:11],
y = names(heatmap.data)[1:6],
col = brewer.pal(n = 7, name = "RdBu"),
cexLabX = 0.5)
dev.off()
```
![](./Figures/heatmap.pdf){width=100%}
Let's focus on the green module, which is significantly increased in COVID-19 compared to other pneumonias and the turquoise module which is significantly decreased in COVID-19 compared to other pneumonias.
```
green = row.names(as.data.frame(bwnet$color[bwnet$color=="green"]))
write.table(green, file="green.txt", sep="\n", row.names=FALSE, col.names=FALSE, quote=FALSE)
turquoise = row.names(as.data.frame(bwnet$color[bwnet$color=="turquoise"]))
write.table(turquoise, file="turquoise.txt", sep="\n", row.names=FALSE, col.names=FALSE, quote=FALSE)
```
Download the files and put the gene lists into cytoscape and look at the KEGG and GO pathways.
<!-- KEGG is viral recognition for green-->
<!-- lots of immune associate genes -->
## Driver genes
<!-- all the datasets we are comparing have samples as rows and they are in the same order -->
#Module Membership#
This will find genes that are highly connected to interesting modules. These hub genes are likely driving the trait response.
We will calculate the module memership or intramodular connectivity by finding the correlation (and p-values) of the eigengene of each module to each gene's expression profile.
```
membership = cor(module_eigengenes, ncnt, use = 'p')
membership.pvals = corPvalueStudent(membership, nSamples)
```
Let's find some of the drivers for the green module. We'll pull out the row for the green module, change the column name to "pvalue", the sort it by the most significant to least significant and grab the top six genes.
```
green.mod = as.data.frame(membership.pvals[rownames(membership.pvals) %in% "MEgreen",])
colnames(green.mod)="pvalue"
head(green.mod[order(green.mod$pvalue),,drop=FALSE])
```
Let's look at the first two genes. There are lots of ways to do this but let's see if we can quickly find some functional information. We will google it then click on the ensembl link. We'll also look at the genecard for the gene.
ENSG00000157601 is MX1 (MX dynamin like GTPase 1). It is part of the antiviral response and is induced by interferons. It interfers with viral replication.
ENSG00000111335 is OAS2 (2'-5'-oligoadenylate synthetase 2). It is involved in the innate immune response to viruses. It is induced by interferons and activates latent RNAase L to degrade virus and prevent replication.
#Gene Significance#
This will identify genes that are significantly associated with the trait of interest.
Calculate the gene significance for the diagnosis trait and their associated p-values. We'll use the diagnosis column from metafinal rather than heatmap.data because the sample (row) order in metafinal is the same as that in ncnts, which we'll be comparing it to. We'll need to make it numeric.
```
metafinal$diagnosis = gsub('Other Pneumonia', '0', metafinal$diagnosis)
metafinal$diagnosis = gsub('COVID-19', '1', metafinal$diagnosis)
metafinal$diagnosis = as.numeric(metafinal$diagnosis)
gene.signf <- cor(ncnt, metafinal$diagnosis, use = 'p')
gene.signf.pvals <- as.data.frame(corPvalueStudent(gene.signf, nSamples))
```
Take a look at the genes sorted starting with the most significant. These genes have a high impact on differentiating between COVID-19 and other pneumonias (diagnosis trait).
```
head(gene.signf.pvals[order(gene.signf.pvals$V1),,drop = FALSE])
```
Again, let's look at the first 2 genes.
ENSG00000165949 is IFI27 (interferon alpha inducible protein 27). It is induced by interferon and regulates transcription. It is known to be involved in defense respons and apoptosis pathways.
ENSG00000132938 is MTUS2 (microtubule associated scaffold protein 2). This gene triggers microtuble binding and protein homodimerization. Located with the centrosomes and cytoplasmic microtubules.
Bonus exercises:
1) Look at modules important to some of the other phenotypes.
2) Do differential expression on the COVID-19 dataset.