analysis.Rmd

---
title: "FormatSumstats"
author: "Alan Murphy"
date: "Most recent update:<br> `r Sys.Date()`"
output: 
  rmarkdown::html_document: 
    theme: spacelab
    highlight: zenburn 
    code_folding: show 
    toc: true 
    toc_float: true
    smooth_scroll: true
    number_sections: false 
    self_contained: true 
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=T, message=F}
#root.dir <- here::here()
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  #root.dir = root.dir
  fig.height = 6,
  fig.width = 7.00787 #178mm
)  
knitr::opts_knit$set(#root.dir = root.dir, 
                     dpi = 350)  

library(data.table)
library(ggplot2)
library(cowplot)
library(stringr)
```

## Background

SumstatsFormats builds upon the work of the [gwas-download](https://github.com/mikegloudemans/gwas-download) repository which created a script to download publicly available GWAS summary statistics from over 200 different publications. If you want to run the code yourself please first run the download script from this repository and set your working directory to this repository's folder.

The aim of this analysis is to give insight into the most frequent format of summary statistic files from GWAS. If you want to run the code yourself please first run the download script from [gwas-download](https://github.com/mikegloudemans/gwas-download) and set your working directory to this repository's folder.

## Acquire GWAS Summary Statistic formats

*NOTE* this section will only run if you have ran the gwas-download script and set it's folder as the working directory. I have saved the results of this section in this repository so the subsequent analysis can still be run.

Load required packages for analysis
```{r}

```

First we need to collect all the summary statistic file headers from the 200+ GWAS. Note that each GWAS has more than one associated summary statistic file.

```
files <- list.files(path = "download", full.names = TRUE, recursive = TRUE)
print(head(files))
```
To get all the summary statistic file headers we need to inspect

 * gunzipped files (.gz)
 * zipped files (.zip)
 * text files (.txt)

Each of these will be dealt with separately 
```
#gunzipped files
files_gz <- files[grep(".gz$", files)]

headers_gz <- vector(mode="character",length=length(files_gz))
names(headers_gz) <- files_gz

# Unzip the file into the dir
for(file_i in files_gz){
  #only add header if one is found
  if(length(readLines(file_i,n=1))!=0)
    headers_zip[[file_i]] <- readLines(file_i,n=1)#n=number of lines to read
}
#remove empties
headers_gz <- headers_gz[headers_gz!=""]


#zipped files
files_zip <- files[grep(".zip$", files_zip)]
zip_headers <- c()
count1 <- 1
# Unzip the file into the dir
for(file_i in files_zip){
  breaker<-FALSE
  #some don't want to open, if this happens skip them
  possibleError <- tryCatch(unzip(file_i,list=TRUE),error=function(e) e)
  if(inherits(possibleError, "error"))
    breaker=TRUE
  if(isFALSE(breaker)){
    items_file_i<-unzip(file_i,list=TRUE)
    #remove READMEs
    zip_files_i <-
      items_file_i$Name[!grepl("README", items_file_i$Name, fixed = TRUE)]
    counting <- 1
    for(zip_i in zip_files_i){
      possibleError2 <- tryCatch(readLines(unzip(file_i,files=zip_i),n=1),
                                 error=function(e) e)
      if(!inherits(possibleError2, "error")){
        a<-readLines(unzip(file_i,files=zip_i),n=1)
        if(a=="%PDF-1.4"||a==""){#not correct file
          #do nothing
          counting <- counting+1
        }
        else{
          zip_headers <- c(zip_headers,a)
          names(zip_headers)[count1] <- paste0(file_i,counting)
          counting <- counting+1
          count1 <- count1+1
        }
      }
    }
  }
}


#text files
files_txt <- files[grep(".txt$", files)]
#remove READMEs anmd other odd formats that will wreck downstream analysis
files_txt <- files_txt[!grepl("README", files_txt, fixed = TRUE)]
files_txt <- files_txt[!grepl("Readme", files_txt, fixed = TRUE)]

headers_txt <- vector(mode="character",length=length(files_txt))
names(headers_txt) <- files_txt

# Read the file
for(file_i in files_txt){
  #only add header if one is found
  if(length(readLines(file_i,n=1))!=0)
    headers_txt[[file_i]] <- readLines(file_i,n=1)#n=number of lines to read
}
#remove empties
headers_txt <- headers_txt[headers_txt!=""]

headers_combined <- c(headers_txt,headers_gz,zip_headers)
#change directory to SumstatFormat
save(headers_combined, file="data/SumstatHeaders.rda")
```

## Analysis GWAS Summary Statistic formats

Now that the headers are derived for each file type, we can combine then and check how many unique headers there are:

```{r}
load(file="data/SumstatHeaders.rda")
print(length(headers_combined))
```

```{r}
tbl_headers <- table(headers_combined) 

dt_headers <- data.table::data.table("header"=names(tbl_headers),
                                     "counts"=as.numeric(tbl_headers))
print(nrow(dt_headers))
```

There were *327* headers from the combined file types and *127* unique formats for the summary statistic files.

This shows the clear disparity across GWAS studies. We can plot and inspect the top 12 most common headers:

```{r}
top12 <- dt_headers[counts>=sort(dt_headers$counts,decreasing=TRUE)[12],]

print(sum(top12$counts)/sum(dt_headers$counts)) 
```

The top 12 headers account for ~ 47% of all summary statistic files. We can see these and their distribution by plotting:

```{r}
#wrap headers to make them readable
#issue with first entry and wrap text since it has "\" so using
top12[,plot_header:=str_wrap(iconv(enc2utf8(header),sub="byte"), 
                              width = 45)]
#Have to manually wrap following since there are no spaces: "markername,allele1,allele2,effect,stderr,p_value,direction,chr,bp" 
top12[header=="markername,allele1,allele2,effect,stderr,p_value,direction,chr,bp",
      plot_header:="markername,allele1,allele2,effect,\nstderr,p_value,direction,chr,bp"]
top12[header=="chromosome rs_id markername allele_1 allele_2\nFreq_Allele1_HapMapCEU effect stderr p_value\nN",
      plot_header:="chromosome rs_id markername allele_1 allele_2\nFreq_Allele1_HapMapCEU effect stderr p_value N"]
ggplot(data=top12,
       aes(x = reorder(plot_header, counts),y = counts, fill=counts))+
  geom_bar(stat="identity")+
  geom_text(aes(label = scales::percent((counts/sum(dt_headers$counts)))),
              hjust = 1.2,size=3.5,colour="lightgrey")+
  labs(y= "Number of GWAS with this header", x = "Summary Statistics Headers") +
  theme_cowplot()+
  theme(axis.text = element_text(size=8), axis.title = element_text(size=10),
          legend.text=element_text(size=8),legend.title=element_text(size=10))+
  scale_fill_continuous(high = "#132B43", low = "#56B1F7")+
  coord_flip()
```

We can then save this plot:

```{r Save plot}
#save as .pdf and .tiff
ggplot2::ggsave(filename = here::here("plots/SumstatsFormats.pdf"),
                height = 6,
                width = 7.00787, #178mm
                dpi = 350,
                units="in")
ggplot2::ggsave(filename = here::here("plots/SumstatsFormats.tiff"),
                height = 6,
                width = 7.00787, #178mm
                dpi = 350,
                units="in")
```


## Most common Sumstat formats - sample files

Again if you ran the gwas-download script, the following section of code will pull down an example summary statistics file from one of the studies for each of the 12 most common formats. This can be used to inspect and test integration methods: see [MungeSumstats](https://github.com/neurogenomics/MungeSumstats) for an R based solution to standardisation. 

```
top12_sumstats <- top12$header
#get a directory for each so we can get some rows of data
top12_sumstats_example <-
  lapply(top12_sumstats, function(x)
    names(headers_combined[headers_combined==x][1]))

#.zips have a count beside (all are .zip1 remove the 1)
top12_sumstats_example<-gsub(".zip1",".zip",top12_sumstats_example)
top12_sumstats_example<-gsub(".zip2",".zip",top12_sumstats_example)

names(top12_sumstats_example) <- top12_sumstats


#get 5 rows of data for each
top12_sumstats_example_data <- vector(mode="list",
                                      length=length(top12_sumstats_example))
names(top12_sumstats_example_data) <- as.character(top12_sumstats_example)
count <- 1

for(example_i in top12_sumstats_example){
  extension <-
    substr(example_i,nchar(example_i)-3,nchar(example_i))
  if(extension==".zip"){#if .zip
    items_file_i<-unzip(example_i,list=TRUE)
    if(length(grep(".gz$",items_file_i$Name[1]))!=0 ||
       length(grep(".csv$",items_file_i$Name[2]))!=0){ #can deal with .gz
      row_ind <- 1
      #take second if csv
      if(length(grep(".csv$",items_file_i$Name[2])))
        row_ind <- 2
      top12_sumstats_example_data[[count]]<-
        readLines(unzip(example_i,files=items_file_i[row_ind,1]),n=6)
    }
    else{#or .txt
      items_file_i<-items_file_i[grep(".txt$",items_file_i$Name),]
      top12_sumstats_example_data[[count]]<-
        readLines(unzip(example_i,files=items_file_i[1,1]),n=6)
    }
  }
  else{#.txt or if .gz
    top12_sumstats_example_data[[count]]<-readLines(example_i,n=6)
  }
  count <- count + 1
}

top12_sumstats_example_data
save(top12_sumstats_example_data,
     file="data/top12_sumstats_header_format_and_data.RData")
```

We can load and inspect these files:
```{r}
load(file="data/top12_sumstats_header_format_and_data.RData")

print(top12_sumstats_example_data[1])
```

## Future work

The SumstatFormats analysis highlights the disparity across summary statistic files, creating a barrier major to automated meta-analysis studies. There has been a push recently to standardise summary statistic files. For example, there is a group who have now manually standardised many GWAS: [R interface to the IEU GWAS database API • ieugwasr](https://mrcieu.github.io/ieugwasr/) and [gwasvcf](https://github.com/MRCIEU/gwasvcf) but because a lot of GWAS
remain closed access, these repositories are not all encompassing. Moreover, the use of VCF format for summary statistics seen increased use but older GWAS still need a method of intgration with these for meta-analysis. 

To address this issue, the bioconductor R package *MungeSumstats* was created to standardise the file format of summary statistic files, including VCF files. See the [MungeSumstats](https://github.com/neurogenomics/MungeSumstats) github page for more details.