Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading in VCF's #6

Closed
Euphrasiologist opened this issue Mar 15, 2024 · 26 comments
Closed

Reading in VCF's #6

Euphrasiologist opened this issue Mar 15, 2024 · 26 comments
Assignees
Labels
enhancement New feature or request

Comments

@Euphrasiologist
Copy link
Collaborator

Euphrasiologist commented Mar 15, 2024

I believe you already have the functionality to do this? Would require a function like this:

read_vcf <- function(file, quiet = TRUE, ...) {
  inner_vcf <- vcfR::read.vcfR(file = file, verbose = !quiet, ...)
  as_gen_tibble(inner_vcf)
}

Would something like that be sufficient?

Cheers,
M

@dramanica
Copy link
Member

Hi @Euphrasiologist, we did have that functionality, but it is gone with the rewrite to use File Backed Matrices. Have a look at the branch fbm. I have switched from using snpBIN objects, as they were simply too slow to scale. I now use bigSNP objects from the package bigsnpr to store the data on disk. That brings us close to PLINK performance, easily handling a thousand of individuals with a million SNPs.
I have a few minor things to clean up, but you can have a look at the vignette in that branch as a walk through of how things work now (sorry, I did ask Jason to let you know about the big changes, but the message must have not reached you in time).

@dramanica
Copy link
Member

Now, a few thoughts on how vcf reading should work.
The simplest option would be to convert the vcf to a bed. I could not find anything in R that does that, which is surprising. The right approach would be to read the vcf in chunks, then extract the biallelic snps, and write the bed with genio. It has an append option which allows to expand your bed. In that way, we could read massive vcf files.
Writing directly to a FBM is a bit more tricky. We would have to first inspect the vcf, figure out the total number of biallelic sites (the columns in the matrix), and then initialise a FBM object (with an appropriate backing file). Then we would do the same as above, read the vcf in chunks, and fill in the columns of the FBM.

@Euphrasiologist
Copy link
Collaborator Author

Brilliant thanks for the info, I'll take a look and give it a whirl.

@dramanica
Copy link
Member

Actually, you just inspired me to rescue a bit of the old as_gen_tibble. So, if you do a pull now on the fbm branch, you can do:

vcf_path <- system.file("/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz",
                         package = "tidypopgen")
 bed_path <- gt_vcf_to_bed(vcf_path, bed_path = tempfile("anolis_"))
 test_gt <- gen_tibble(bed_path)

But it really only works for smallish vcf files where the whole content can be kept in memory for the conversion.

@dramanica dramanica self-assigned this Mar 16, 2024
@Euphrasiologist
Copy link
Collaborator Author

Does your self assign mean you're happy to do this? I haven't started anything yet!

@dramanica
Copy link
Member

@Euphrasiologist no, very happy for someone else to have a go. I'll give you wite access, just branch and pull request. If you want to be fancy, you could think about also writing directly an fbm instead of going through bed.

@dramanica
Copy link
Member

@Euphrasiologist Have a look at the new method I wrote for gen_tibble which allows to create a gen_tibble from a matrix of genotypes (https://github.com/EvolEcolGroup/tidypopgen/blob/fbm/R/gen_tibble.R). I got bored of creating BED files for every test...
In principle, it could be adapted to a vcf, either in one go (as the old as_gen_tibble you suggested), or better, reading the vcf in chunks and filling in the FBM (which would allow for much bigger datasets).

@Euphrasiologist
Copy link
Collaborator Author

I've accidentally committed straight to fbm branch instead of vcf, sorry! I'll add a test tomorrow:
411c5e8

@dramanica
Copy link
Member

@Euphrasiologist No worries. I have moved the code under gen_tibble, as vcf files are then treated in the same way as .bed or .rds.

@dramanica
Copy link
Member

This looks like a very good option to write a function to read in vcf files directly:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868310/

@dramanica
Copy link
Member

A sensible strategy is:
Open the vcf
Read a chunk of x lines. Extract binary SNPs. Create an FBM with the right number of columns (markers) and individuals for this chunk.
Read the next chunk. Extract binary SNPs. Add columns to the FBM and fill them with the new SNPs.
Keep repeating until we get to end of file.

@dramanica dramanica added the enhancement New feature or request label Mar 22, 2024
@dramanica
Copy link
Member

I have added a simple count_vcf_variants in the branch vcf. That should facilitate parsing a vcf in chunks.

@Euphrasiologist
Copy link
Collaborator Author

I'd been working on something like this.

library(vcfR)
library(data.table)
library(bigstatsr)
setDTthreads(threads = 2)

# check dimensions of snp matrix using vcfR::read.vcfR()
# and iterate over the VCF in chunks. use the combination of the
# number of rows to read and the number of rows to skip

vcf_path <- "~/Documents/software/tidypopgen/inst/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz"

v <- fread(vcf_path)
colnm <- colnames(v)
vcf_dim <- dim(v)
nrow_ <- vcf_dim[1]

# using the nrow above, we can read the VCF in chunks
# of nrows and skip the first i * nrows rows
# split nrow into chunks of nrows with the last chunk
# being the remainder
nrows <- 1000
chunks_vec <- c(rep(nrows, floor(nrow_ / nrows)), nrow_ %% nrows)

# iterate over the chunks vec, read in the VCF and
# calculate the number of SNPs in each chunk

for (i in 1:length(chunks_vec)) {
  temp_vcf <- read.vcfR(vcf_path, nrow = chunks_vec[i], skip = sum(chunks_vec[1:(i - 1)]))
  gt <- vcfR::extract.gt(temp_vcf)
  #... todo
}

Do you think your implementation of reading the VCF would be faster than going over once with data.table::fread?

Related note. For the life of me I can't see how to append to an FBM. (https://search.r-project.org/CRAN/refmans/bigstatsr/html/FBM-class.html) There's a method to add columns, but not rows. I was thinking it's probably easier to write the matrix back to disk, merge, then convert to FBM?

@dramanica
Copy link
Member

If all you are doing with read.table is to count lines, then I would try the function I wrote, so that we don't bring in a dependency just to count lines. I think it should be pretty quick.
As for appending, luckily you are adding columns (loci are columns, individuals are rows). So, a row in the vcf is a column in the file backed matrix. So, just add the columns, and put in the data.
For the record, you can't add rows easily to a FBM, I did for rbind, and you end up having to transpose the files and appending one to the other, pretty messy.

@Euphrasiologist
Copy link
Collaborator Author

I thought that transposing might be the way, but that's great news. Brilliant. I'll get on that.

@dramanica
Copy link
Member

It might make sense to merge the vcf branch into ploidy, so that we can get vcf reading for diploid and polyploids working at the same time. Reading vcf is the last element (#19) (plus a couple more unit tests) for ploidy to be ready to be merged into main so that we officially start supporting polyploidy (and thus think about it as we develop new functions).

@Euphrasiologist
Copy link
Collaborator Author

Merged now (I hope), need to work on the function a bit more...

@dramanica dramanica mentioned this issue May 13, 2024
@dramanica
Copy link
Member

I added multiple ploidy to the vcf reading code, and I think that reading vcf in chunks is now fully functional and ploidy compatible. Some more testing would be wise, though.
Also, we should benchmark the multiple ploidy parser compared to the old specialised diploid version, to decide whether we can just keep the generic version, or whether an optimised diploid version needs to be brought back (it would be trivial to add a "assume_diploid=TRUE" parameter if needed).

@Euphrasiologist
Copy link
Collaborator Author

Okay, I'll have a look around for a largeish multiple ploidy VCF.

@dramanica
Copy link
Member

@Euphrasiologist Have a look at the ploidy issue, there is a good vcf associated with a book chapter I linked to.

@Euphrasiologist
Copy link
Collaborator Author

Oops forgot about that!

@dramanica
Copy link
Member

We now have multiple tests for VCF with diploid organisms, checking against bed files. So, diploid vcfs should be good (thanks to @eviecarter33!).
For multiple ploidy, we have a basic test in tests/testthat/test_ploidy_vcf.R, but we need to add a bit of depth to that test. The thing to do would be to have a look at that minimalistic VCF and check which individuals are diploid and which tetraploid, and make sure that we have parsed that correctly. Also, hand checking a couple of genotypes would be good. Once we have those tests, I think we could consider closing this issue, as we can then read VCFs of any ploidy.

@dramanica
Copy link
Member

The only other thing to consider is whether the parsing should be parallelised (we use an apply, but we could think about using a parallel apply to convert genotypes in dosages). I guess we should benchmark a bit how quickly we parse large vcfs and decide whether is needed.

@Euphrasiologist
Copy link
Collaborator Author

In parallelising, do we care about the order of the records in the VCF?

@dramanica
Copy link
Member

dramanica commented May 23, 2024

Yes, I think we do, as a number of analysis care about order.

@dramanica
Copy link
Member

dramanica commented Jun 21, 2024

I have written a C++ parser that is lean and mean, and is, in principle, quite a bit faster than the current parsing with vcfR. It is in the branch vcf_fast. But it needs some testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants