-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis.Rmd
279 lines (229 loc) · 10.2 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
---
title: "FormatSumstats"
author: "Alan Murphy"
date: "Most recent update:<br> `r Sys.Date()`"
output:
rmarkdown::html_document:
theme: spacelab
highlight: zenburn
code_folding: show
toc: true
toc_float: true
smooth_scroll: true
number_sections: false
self_contained: true
editor_options:
chunk_output_type: inline
---
```{r setup, include=T, message=F}
#root.dir <- here::here()
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
#root.dir = root.dir
fig.height = 6,
fig.width = 7.00787 #178mm
)
knitr::opts_knit$set(#root.dir = root.dir,
dpi = 350)
library(data.table)
library(ggplot2)
library(cowplot)
library(stringr)
```
## Background
SumstatsFormats builds upon the work of the [gwas-download](https://github.com/mikegloudemans/gwas-download) repository which created a script to download publicly available GWAS summary statistics from over 200 different publications. If you want to run the code yourself please first run the download script from this repository and set your working directory to this repository's folder.
The aim of this analysis is to give insight into the most frequent format of summary statistic files from GWAS. If you want to run the code yourself please first run the download script from [gwas-download](https://github.com/mikegloudemans/gwas-download) and set your working directory to this repository's folder.
## Acquire GWAS Summary Statistic formats
*NOTE* this section will only run if you have ran the gwas-download script and set it's folder as the working directory. I have saved the results of this section in this repository so the subsequent analysis can still be run.
Load required packages for analysis
```{r}
```
First we need to collect all the summary statistic file headers from the 200+ GWAS. Note that each GWAS has more than one associated summary statistic file.
```
files <- list.files(path = "download", full.names = TRUE, recursive = TRUE)
print(head(files))
```
To get all the summary statistic file headers we need to inspect
* gunzipped files (.gz)
* zipped files (.zip)
* text files (.txt)
Each of these will be dealt with separately
```
#gunzipped files
files_gz <- files[grep(".gz$", files)]
headers_gz <- vector(mode="character",length=length(files_gz))
names(headers_gz) <- files_gz
# Unzip the file into the dir
for(file_i in files_gz){
#only add header if one is found
if(length(readLines(file_i,n=1))!=0)
headers_zip[[file_i]] <- readLines(file_i,n=1)#n=number of lines to read
}
#remove empties
headers_gz <- headers_gz[headers_gz!=""]
#zipped files
files_zip <- files[grep(".zip$", files_zip)]
zip_headers <- c()
count1 <- 1
# Unzip the file into the dir
for(file_i in files_zip){
breaker<-FALSE
#some don't want to open, if this happens skip them
possibleError <- tryCatch(unzip(file_i,list=TRUE),error=function(e) e)
if(inherits(possibleError, "error"))
breaker=TRUE
if(isFALSE(breaker)){
items_file_i<-unzip(file_i,list=TRUE)
#remove READMEs
zip_files_i <-
items_file_i$Name[!grepl("README", items_file_i$Name, fixed = TRUE)]
counting <- 1
for(zip_i in zip_files_i){
possibleError2 <- tryCatch(readLines(unzip(file_i,files=zip_i),n=1),
error=function(e) e)
if(!inherits(possibleError2, "error")){
a<-readLines(unzip(file_i,files=zip_i),n=1)
if(a=="%PDF-1.4"||a==""){#not correct file
#do nothing
counting <- counting+1
}
else{
zip_headers <- c(zip_headers,a)
names(zip_headers)[count1] <- paste0(file_i,counting)
counting <- counting+1
count1 <- count1+1
}
}
}
}
}
#text files
files_txt <- files[grep(".txt$", files)]
#remove READMEs anmd other odd formats that will wreck downstream analysis
files_txt <- files_txt[!grepl("README", files_txt, fixed = TRUE)]
files_txt <- files_txt[!grepl("Readme", files_txt, fixed = TRUE)]
headers_txt <- vector(mode="character",length=length(files_txt))
names(headers_txt) <- files_txt
# Read the file
for(file_i in files_txt){
#only add header if one is found
if(length(readLines(file_i,n=1))!=0)
headers_txt[[file_i]] <- readLines(file_i,n=1)#n=number of lines to read
}
#remove empties
headers_txt <- headers_txt[headers_txt!=""]
headers_combined <- c(headers_txt,headers_gz,zip_headers)
#change directory to SumstatFormat
save(headers_combined, file="data/SumstatHeaders.rda")
```
## Analysis GWAS Summary Statistic formats
Now that the headers are derived for each file type, we can combine then and check how many unique headers there are:
```{r}
load(file="data/SumstatHeaders.rda")
print(length(headers_combined))
```
```{r}
tbl_headers <- table(headers_combined)
dt_headers <- data.table::data.table("header"=names(tbl_headers),
"counts"=as.numeric(tbl_headers))
print(nrow(dt_headers))
```
There were *327* headers from the combined file types and *127* unique formats for the summary statistic files.
This shows the clear disparity across GWAS studies. We can plot and inspect the top 12 most common headers:
```{r}
top12 <- dt_headers[counts>=sort(dt_headers$counts,decreasing=TRUE)[12],]
print(sum(top12$counts)/sum(dt_headers$counts))
```
The top 12 headers account for ~ 47% of all summary statistic files. We can see these and their distribution by plotting:
```{r}
#wrap headers to make them readable
#issue with first entry and wrap text since it has "\" so using
top12[,plot_header:=str_wrap(iconv(enc2utf8(header),sub="byte"),
width = 45)]
#Have to manually wrap following since there are no spaces: "markername,allele1,allele2,effect,stderr,p_value,direction,chr,bp"
top12[header=="markername,allele1,allele2,effect,stderr,p_value,direction,chr,bp",
plot_header:="markername,allele1,allele2,effect,\nstderr,p_value,direction,chr,bp"]
top12[header=="chromosome rs_id markername allele_1 allele_2\nFreq_Allele1_HapMapCEU effect stderr p_value\nN",
plot_header:="chromosome rs_id markername allele_1 allele_2\nFreq_Allele1_HapMapCEU effect stderr p_value N"]
ggplot(data=top12,
aes(x = reorder(plot_header, counts),y = counts, fill=counts))+
geom_bar(stat="identity")+
geom_text(aes(label = scales::percent((counts/sum(dt_headers$counts)))),
hjust = 1.2,size=3.5,colour="lightgrey")+
labs(y= "Number of GWAS with this header", x = "Summary Statistics Headers") +
theme_cowplot()+
theme(axis.text = element_text(size=8), axis.title = element_text(size=10),
legend.text=element_text(size=8),legend.title=element_text(size=10))+
scale_fill_continuous(high = "#132B43", low = "#56B1F7")+
coord_flip()
```
We can then save this plot:
```{r Save plot}
#save as .pdf and .tiff
ggplot2::ggsave(filename = here::here("plots/SumstatsFormats.pdf"),
height = 6,
width = 7.00787, #178mm
dpi = 350,
units="in")
ggplot2::ggsave(filename = here::here("plots/SumstatsFormats.tiff"),
height = 6,
width = 7.00787, #178mm
dpi = 350,
units="in")
```
## Most common Sumstat formats - sample files
Again if you ran the gwas-download script, the following section of code will pull down an example summary statistics file from one of the studies for each of the 12 most common formats. This can be used to inspect and test integration methods: see [MungeSumstats](https://github.com/neurogenomics/MungeSumstats) for an R based solution to standardisation.
```
top12_sumstats <- top12$header
#get a directory for each so we can get some rows of data
top12_sumstats_example <-
lapply(top12_sumstats, function(x)
names(headers_combined[headers_combined==x][1]))
#.zips have a count beside (all are .zip1 remove the 1)
top12_sumstats_example<-gsub(".zip1",".zip",top12_sumstats_example)
top12_sumstats_example<-gsub(".zip2",".zip",top12_sumstats_example)
names(top12_sumstats_example) <- top12_sumstats
#get 5 rows of data for each
top12_sumstats_example_data <- vector(mode="list",
length=length(top12_sumstats_example))
names(top12_sumstats_example_data) <- as.character(top12_sumstats_example)
count <- 1
for(example_i in top12_sumstats_example){
extension <-
substr(example_i,nchar(example_i)-3,nchar(example_i))
if(extension==".zip"){#if .zip
items_file_i<-unzip(example_i,list=TRUE)
if(length(grep(".gz$",items_file_i$Name[1]))!=0 ||
length(grep(".csv$",items_file_i$Name[2]))!=0){ #can deal with .gz
row_ind <- 1
#take second if csv
if(length(grep(".csv$",items_file_i$Name[2])))
row_ind <- 2
top12_sumstats_example_data[[count]]<-
readLines(unzip(example_i,files=items_file_i[row_ind,1]),n=6)
}
else{#or .txt
items_file_i<-items_file_i[grep(".txt$",items_file_i$Name),]
top12_sumstats_example_data[[count]]<-
readLines(unzip(example_i,files=items_file_i[1,1]),n=6)
}
}
else{#.txt or if .gz
top12_sumstats_example_data[[count]]<-readLines(example_i,n=6)
}
count <- count + 1
}
top12_sumstats_example_data
save(top12_sumstats_example_data,
file="data/top12_sumstats_header_format_and_data.RData")
```
We can load and inspect these files:
```{r}
load(file="data/top12_sumstats_header_format_and_data.RData")
print(top12_sumstats_example_data[1])
```
## Future work
The SumstatFormats analysis highlights the disparity across summary statistic files, creating a barrier major to automated meta-analysis studies. There has been a push recently to standardise summary statistic files. For example, there is a group who have now manually standardised many GWAS: [R interface to the IEU GWAS database API • ieugwasr](https://mrcieu.github.io/ieugwasr/) and [gwasvcf](https://github.com/MRCIEU/gwasvcf) but because a lot of GWAS
remain closed access, these repositories are not all encompassing. Moreover, the use of VCF format for summary statistics seen increased use but older GWAS still need a method of intgration with these for meta-analysis.
To address this issue, the bioconductor R package *MungeSumstats* was created to standardise the file format of summary statistic files, including VCF files. See the [MungeSumstats](https://github.com/neurogenomics/MungeSumstats) github page for more details.