-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading in VCF's #6
Comments
Hi @Euphrasiologist, we did have that functionality, but it is gone with the rewrite to use File Backed Matrices. Have a look at the branch |
Now, a few thoughts on how vcf reading should work. |
Brilliant thanks for the info, I'll take a look and give it a whirl. |
Actually, you just inspired me to rescue a bit of the old
But it really only works for smallish vcf files where the whole content can be kept in memory for the conversion. |
Does your self assign mean you're happy to do this? I haven't started anything yet! |
@Euphrasiologist no, very happy for someone else to have a go. I'll give you wite access, just branch and pull request. If you want to be fancy, you could think about also writing directly an fbm instead of going through bed. |
@Euphrasiologist Have a look at the new method I wrote for |
I've accidentally committed straight to fbm branch instead of vcf, sorry! I'll add a test tomorrow: |
@Euphrasiologist No worries. I have moved the code under |
This looks like a very good option to write a function to read in vcf files directly: |
A sensible strategy is: |
I have added a simple |
I'd been working on something like this. library(vcfR)
library(data.table)
library(bigstatsr)
setDTthreads(threads = 2)
# check dimensions of snp matrix using vcfR::read.vcfR()
# and iterate over the VCF in chunks. use the combination of the
# number of rows to read and the number of rows to skip
vcf_path <- "~/Documents/software/tidypopgen/inst/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz"
v <- fread(vcf_path)
colnm <- colnames(v)
vcf_dim <- dim(v)
nrow_ <- vcf_dim[1]
# using the nrow above, we can read the VCF in chunks
# of nrows and skip the first i * nrows rows
# split nrow into chunks of nrows with the last chunk
# being the remainder
nrows <- 1000
chunks_vec <- c(rep(nrows, floor(nrow_ / nrows)), nrow_ %% nrows)
# iterate over the chunks vec, read in the VCF and
# calculate the number of SNPs in each chunk
for (i in 1:length(chunks_vec)) {
temp_vcf <- read.vcfR(vcf_path, nrow = chunks_vec[i], skip = sum(chunks_vec[1:(i - 1)]))
gt <- vcfR::extract.gt(temp_vcf)
#... todo
} Do you think your implementation of reading the VCF would be faster than going over once with Related note. For the life of me I can't see how to append to an FBM. (https://search.r-project.org/CRAN/refmans/bigstatsr/html/FBM-class.html) There's a method to add columns, but not rows. I was thinking it's probably easier to write the matrix back to disk, merge, then convert to FBM? |
If all you are doing with |
I thought that transposing might be the way, but that's great news. Brilliant. I'll get on that. |
It might make sense to merge the |
Merged now (I hope), need to work on the function a bit more... |
I added multiple ploidy to the vcf reading code, and I think that reading vcf in chunks is now fully functional and ploidy compatible. Some more testing would be wise, though. |
Okay, I'll have a look around for a largeish multiple ploidy VCF. |
@Euphrasiologist Have a look at the ploidy issue, there is a good vcf associated with a book chapter I linked to. |
Oops forgot about that! |
We now have multiple tests for VCF with diploid organisms, checking against bed files. So, diploid vcfs should be good (thanks to @eviecarter33!). |
The only other thing to consider is whether the parsing should be parallelised (we use an apply, but we could think about using a parallel apply to convert genotypes in dosages). I guess we should benchmark a bit how quickly we parse large vcfs and decide whether is needed. |
In parallelising, do we care about the order of the records in the VCF? |
Yes, I think we do, as a number of analysis care about order. |
I have written a C++ parser that is lean and mean, and is, in principle, quite a bit faster than the current parsing with |
I believe you already have the functionality to do this? Would require a function like this:
Would something like that be sufficient?
Cheers,
M
The text was updated successfully, but these errors were encountered: