-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up readVcf
#59
Comments
Please see the response in your previous issue, especially |
Hi @mtmorgan, thank you for the response. From my understanding My suggestion is to speed up As for the other suggestion with specifying specific fields ( I'll check out |
I do think that your objectives can be met without changes to VariantAnnotation by making use of available parallelism/divide and conquer. As I noted you have a bottleneck with the cbind on thin data.tables and you should try to rectify that in vcf2df. I will demonstrate scalability in ways that I think reflect Martin's suggestion, soon. |
arrgh, I moved my comment to the issue I intended to comment on. |
@bschilder I hope Martin's examples at #57 (comment) are sufficient to get you to a better place. If more details are needed about the specific use case of vcf->df please say where the gaps are. |
Hi @mtmorgan, thank you so much for these really helpful examples! I've tried implementing a version of this but in my case, I actually found that multi-threading increased compute time somehow (from <2sec to >1min). Am I doing something wrong below? Single- vs. multi-threadedpath <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
## Using empty param here simply for demo purposes
param <- VariantAnnotation::ScanVcfParam()
#### Single-threaded ####
system.time({
vcf <- VariantAnnotation::readVcf(file = path,
param = param)
})
## Takes 1.6 seconds
#### Multi-threaded (11 cores) ####
BiocParallel::register(
BiocParallel::SnowParam(workers = 11,
progressbar = T)
)
vcf_file <- VariantAnnotation::VcfFile(file = path,
index = paste0(path,".tbi"))
## Tile ranges across the genome
tiles <-
GenomicRanges::seqinfo(vcf_file) |>
GenomeInfoDb::keepSeqlevels(as.character(1:22)) |>
GenomicRanges::tileGenome(cut.last.tile.in.chrom = TRUE,
tilewidth = 1e7L)
## Create mapping function
MAP <- function(range, file) {
param <- VariantAnnotation::ScanVcfParam(which = range, )
vcf <- VariantAnnotation::readVcf(file = file,
param = param, genome="HG19/GRCh37")
nrow(vcf)
}
## Parallelised query
system.time({
vcf <- GenomicFiles::reduceByRange(ranges = tiles,
files = path,
MAP = MAP,
REDUCE = `+`)
})
## Takes 1.2 minutes Pre-selecting fieldsI agree, my strategy of sampling the top N rows to determine which columns are likely empty is not ideal. In an ideal world, VCF headers would only contain info that is actually present (and populated with not just NAs). But in the course of munging many different GWAS sumstats VCFs, we've found this is rarely the case (@Al-Murphy). Almost all of the sumstats VCFs from OpenGWAS contain 2-8 columns filled entirely with NAs (genome-wide!). Thus, I think for our use case, sampling the top 1000 rows is pretty predictive of whether the columns are entirely empty (though I haven't yet gathered stats on this). |
There are three things going on.
|
Thanks again for all the very detailed and helpful explanations @mtmorgan I've tried a number of ways to improve |
It's worth noting that URL <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
utils::download.file(URL, basename(URL))
utils::download.file(paste0(URL,".tbi"), paste0(basename(URL),".tbi"))
path <- basename(URL)
res <- microbenchmark::microbenchmark(
VariantAnnotation = {
vcf1 <- VariantAnnotation::readVcf(path)
},
vcfR = {
vcf2 <- vcfR::read.vcfR(path)
},
times = 1
) This speed difference is even greater with larger VCFs (11 millions variants). In this case, URL <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz" Maybe the author of If for whatever reason |
For the small file with
For the large file with
I know that for VariantAnnotation times can be strongly influenced by R's garbage collector, especially if it runs frequently as R consumes more memory, e.g., when first allocating a large number of character vectors. As far as I can tell the files are parsed differently, e.g.,
I thought I could
It looks like the file is only partially parsed so additional time in extracting?
Some of the IDs appear to be duplicated...
|
I did a little bit of work on this today. Here is the basic code:
In summary, I personally have no basis for going into readVcf and trying to speed it up. It may be possible, but given that the original authors have left the project and there is a reasonable path to divide and conquer with the ingestion process, I am not inclined to do much more. |
I've noticed this as well, R seems to quickly get clogged up after calling Duly noted about @vjcitn As far a reducing, Download VCFURL <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
utils::download.file(URL, basename(URL))
utils::download.file(paste0(URL,".tbi"), paste0(basename(URL),".tbi"))
path <- basename(URL)
|
Hi @bschilder , VCF data can be seen as consisting of a meta region, a fixed region, and a gt region. An example of the gt region is as follows.
This is a tab delimited, variant (row) by sample (column) matrix of data. The first column specifies the format for the subsequent sample columns. The value of each cell is further delimited by colons. In I am less familiar with Hope that helps! |
The garbage collector problems I mentioned are actually at the other end of things, when R is growing its memory use in response to successive allocation requests; you can see the garbage collector in action by placing
I also ran this again and walked away; my computer went to low-power mode and took 640s so timings, especially on personal computers, are very susceptible to other processes. R memory management is outside the scope of VariantAnnotation. For what it's worth I ran the no-op
so times for the extra work done by For @vjcitn |
To summarize, some degree of performance improvement with large VCF files (e.g., millions of records) can be obtained with There will not be effort in the near term to improve performance of VCF parsing to VCF objects. |
I've gone ahead implemented this functionality into We will be pushing these changes to Bioconductor shortly. @Al-Murphy |
Splitting the Issue originally raised here about improving the efficiency of
readVcf
.@lawremi suggested using chunking as is currently implemented in
writeVcf
here. I imagine this could be further improved through parallelization of that chunking procedure.This should be quite analogous to implement in
readVcf
. Following up on discussion here, would this be something that @vjcitn @mtmorgan @vobencha @lawremi or one of the otherVariantAnnotation
co-authors would be able to implement?Many thanks in advance,
Brian
The text was updated successfully, but these errors were encountered: