Reading in VCF's #6

Euphrasiologist · 2024-03-15T10:49:10Z

I believe you already have the functionality to do this? Would require a function like this:

read_vcf <- function(file, quiet = TRUE, ...) {
  inner_vcf <- vcfR::read.vcfR(file = file, verbose = !quiet, ...)
  as_gen_tibble(inner_vcf)
}

Would something like that be sufficient?

Cheers,
M

dramanica · 2024-03-15T12:02:17Z

Hi @Euphrasiologist, we did have that functionality, but it is gone with the rewrite to use File Backed Matrices. Have a look at the branch fbm. I have switched from using snpBIN objects, as they were simply too slow to scale. I now use bigSNP objects from the package bigsnpr to store the data on disk. That brings us close to PLINK performance, easily handling a thousand of individuals with a million SNPs.
I have a few minor things to clean up, but you can have a look at the vignette in that branch as a walk through of how things work now (sorry, I did ask Jason to let you know about the big changes, but the message must have not reached you in time).

dramanica · 2024-03-15T12:06:37Z

Now, a few thoughts on how vcf reading should work.
The simplest option would be to convert the vcf to a bed. I could not find anything in R that does that, which is surprising. The right approach would be to read the vcf in chunks, then extract the biallelic snps, and write the bed with genio. It has an append option which allows to expand your bed. In that way, we could read massive vcf files.
Writing directly to a FBM is a bit more tricky. We would have to first inspect the vcf, figure out the total number of biallelic sites (the columns in the matrix), and then initialise a FBM object (with an appropriate backing file). Then we would do the same as above, read the vcf in chunks, and fill in the columns of the FBM.

Euphrasiologist · 2024-03-15T12:28:57Z

Brilliant thanks for the info, I'll take a look and give it a whirl.

dramanica · 2024-03-15T13:40:19Z

Actually, you just inspired me to rescue a bit of the old as_gen_tibble. So, if you do a pull now on the fbm branch, you can do:

vcf_path <- system.file("/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz",
                         package = "tidypopgen")
 bed_path <- gt_vcf_to_bed(vcf_path, bed_path = tempfile("anolis_"))
 test_gt <- gen_tibble(bed_path)

But it really only works for smallish vcf files where the whole content can be kept in memory for the conversion.

Euphrasiologist · 2024-03-16T14:30:24Z

Does your self assign mean you're happy to do this? I haven't started anything yet!

dramanica · 2024-03-16T15:08:57Z

@Euphrasiologist no, very happy for someone else to have a go. I'll give you wite access, just branch and pull request. If you want to be fancy, you could think about also writing directly an fbm instead of going through bed.

dramanica · 2024-03-17T14:36:52Z

@Euphrasiologist Have a look at the new method I wrote for gen_tibble which allows to create a gen_tibble from a matrix of genotypes (https://github.com/EvolEcolGroup/tidypopgen/blob/fbm/R/gen_tibble.R). I got bored of creating BED files for every test...
In principle, it could be adapted to a vcf, either in one go (as the old as_gen_tibble you suggested), or better, reading the vcf in chunks and filling in the FBM (which would allow for much bigger datasets).

Euphrasiologist · 2024-03-18T21:20:55Z

I've accidentally committed straight to fbm branch instead of vcf, sorry! I'll add a test tomorrow:
411c5e8

dramanica · 2024-03-18T22:18:13Z

@Euphrasiologist No worries. I have moved the code under gen_tibble, as vcf files are then treated in the same way as .bed or .rds.

dramanica · 2024-03-22T13:33:39Z

This looks like a very good option to write a function to read in vcf files directly:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10868310/

dramanica · 2024-03-22T16:12:02Z

A sensible strategy is:
Open the vcf
Read a chunk of x lines. Extract binary SNPs. Create an FBM with the right number of columns (markers) and individuals for this chunk.
Read the next chunk. Extract binary SNPs. Add columns to the FBM and fill them with the new SNPs.
Keep repeating until we get to end of file.

dramanica · 2024-03-26T03:34:26Z

I have added a simple count_vcf_variants in the branch vcf. That should facilitate parsing a vcf in chunks.

Euphrasiologist · 2024-03-26T10:16:32Z

I'd been working on something like this.

library(vcfR)
library(data.table)
library(bigstatsr)
setDTthreads(threads = 2)

# check dimensions of snp matrix using vcfR::read.vcfR()
# and iterate over the VCF in chunks. use the combination of the
# number of rows to read and the number of rows to skip

vcf_path <- "~/Documents/software/tidypopgen/inst/extdata/anolis/punctatus_t70_s10_n46_filtered.recode.vcf.gz"

v <- fread(vcf_path)
colnm <- colnames(v)
vcf_dim <- dim(v)
nrow_ <- vcf_dim[1]

# using the nrow above, we can read the VCF in chunks
# of nrows and skip the first i * nrows rows
# split nrow into chunks of nrows with the last chunk
# being the remainder
nrows <- 1000
chunks_vec <- c(rep(nrows, floor(nrow_ / nrows)), nrow_ %% nrows)

# iterate over the chunks vec, read in the VCF and
# calculate the number of SNPs in each chunk

for (i in 1:length(chunks_vec)) {
  temp_vcf <- read.vcfR(vcf_path, nrow = chunks_vec[i], skip = sum(chunks_vec[1:(i - 1)]))
  gt <- vcfR::extract.gt(temp_vcf)
  #... todo
}

Do you think your implementation of reading the VCF would be faster than going over once with data.table::fread?

Related note. For the life of me I can't see how to append to an FBM. (https://search.r-project.org/CRAN/refmans/bigstatsr/html/FBM-class.html) There's a method to add columns, but not rows. I was thinking it's probably easier to write the matrix back to disk, merge, then convert to FBM?

dramanica · 2024-03-26T11:37:48Z

If all you are doing with read.table is to count lines, then I would try the function I wrote, so that we don't bring in a dependency just to count lines. I think it should be pretty quick.
As for appending, luckily you are adding columns (loci are columns, individuals are rows). So, a row in the vcf is a column in the file backed matrix. So, just add the columns, and put in the data.
For the record, you can't add rows easily to a FBM, I did for rbind, and you end up having to transpose the files and appending one to the other, pretty messy.

Euphrasiologist · 2024-03-26T14:10:21Z

I thought that transposing might be the way, but that's great news. Brilliant. I'll get on that.

dramanica · 2024-04-22T08:23:06Z

It might make sense to merge the vcf branch into ploidy, so that we can get vcf reading for diploid and polyploids working at the same time. Reading vcf is the last element (#19) (plus a couple more unit tests) for ploidy to be ready to be merged into main so that we officially start supporting polyploidy (and thus think about it as we develop new functions).

Euphrasiologist · 2024-05-03T09:28:01Z

Merged now (I hope), need to work on the function a bit more...

dramanica · 2024-05-13T20:24:59Z

I added multiple ploidy to the vcf reading code, and I think that reading vcf in chunks is now fully functional and ploidy compatible. Some more testing would be wise, though.
Also, we should benchmark the multiple ploidy parser compared to the old specialised diploid version, to decide whether we can just keep the generic version, or whether an optimised diploid version needs to be brought back (it would be trivial to add a "assume_diploid=TRUE" parameter if needed).

Euphrasiologist · 2024-05-15T09:38:29Z

Okay, I'll have a look around for a largeish multiple ploidy VCF.

dramanica · 2024-05-15T09:52:58Z

@Euphrasiologist Have a look at the ploidy issue, there is a good vcf associated with a book chapter I linked to.

Euphrasiologist · 2024-05-15T10:02:56Z

Oops forgot about that!

dramanica · 2024-05-23T06:55:31Z

We now have multiple tests for VCF with diploid organisms, checking against bed files. So, diploid vcfs should be good (thanks to @eviecarter33!).
For multiple ploidy, we have a basic test in tests/testthat/test_ploidy_vcf.R, but we need to add a bit of depth to that test. The thing to do would be to have a look at that minimalistic VCF and check which individuals are diploid and which tetraploid, and make sure that we have parsed that correctly. Also, hand checking a couple of genotypes would be good. Once we have those tests, I think we could consider closing this issue, as we can then read VCFs of any ploidy.

dramanica · 2024-05-23T06:55:37Z

The only other thing to consider is whether the parsing should be parallelised (we use an apply, but we could think about using a parallel apply to convert genotypes in dosages). I guess we should benchmark a bit how quickly we parse large vcfs and decide whether is needed.

Euphrasiologist · 2024-05-23T10:04:53Z

In parallelising, do we care about the order of the records in the VCF?

dramanica · 2024-05-23T11:16:39Z

Yes, I think we do, as a number of analysis care about order.

dramanica · 2024-06-21T15:16:01Z

I have written a C++ parser that is lean and mean, and is, in principle, quite a bit faster than the current parsing with vcfR. It is in the branch vcf_fast. But it needs some testing.

dramanica self-assigned this Mar 16, 2024

dramanica assigned Euphrasiologist Mar 16, 2024

dramanica added the enhancement New feature or request label Mar 22, 2024

dramanica mentioned this issue May 13, 2024

Ploidy #36

Merged

dramanica closed this as completed in 7c5ef31 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading in VCF's #6

Reading in VCF's #6

Euphrasiologist commented Mar 15, 2024 •

edited

Loading

dramanica commented Mar 15, 2024

dramanica commented Mar 15, 2024

Euphrasiologist commented Mar 15, 2024

dramanica commented Mar 15, 2024

Euphrasiologist commented Mar 16, 2024

dramanica commented Mar 16, 2024

dramanica commented Mar 17, 2024

Euphrasiologist commented Mar 18, 2024

dramanica commented Mar 18, 2024

dramanica commented Mar 22, 2024

dramanica commented Mar 22, 2024

dramanica commented Mar 26, 2024

Euphrasiologist commented Mar 26, 2024

dramanica commented Mar 26, 2024

Euphrasiologist commented Mar 26, 2024

dramanica commented Apr 22, 2024

Euphrasiologist commented May 3, 2024

dramanica commented May 13, 2024

Euphrasiologist commented May 15, 2024

dramanica commented May 15, 2024

Euphrasiologist commented May 15, 2024

dramanica commented May 23, 2024

dramanica commented May 23, 2024

Euphrasiologist commented May 23, 2024

dramanica commented May 23, 2024 •

edited

Loading

dramanica commented Jun 21, 2024 •

edited

Loading

Reading in VCF's #6

Reading in VCF's #6

Comments

Euphrasiologist commented Mar 15, 2024 • edited Loading

dramanica commented Mar 15, 2024

dramanica commented Mar 15, 2024

Euphrasiologist commented Mar 15, 2024

dramanica commented Mar 15, 2024

Euphrasiologist commented Mar 16, 2024

dramanica commented Mar 16, 2024

dramanica commented Mar 17, 2024

Euphrasiologist commented Mar 18, 2024

dramanica commented Mar 18, 2024

dramanica commented Mar 22, 2024

dramanica commented Mar 22, 2024

dramanica commented Mar 26, 2024

Euphrasiologist commented Mar 26, 2024

dramanica commented Mar 26, 2024

Euphrasiologist commented Mar 26, 2024

dramanica commented Apr 22, 2024

Euphrasiologist commented May 3, 2024

dramanica commented May 13, 2024

Euphrasiologist commented May 15, 2024

dramanica commented May 15, 2024

Euphrasiologist commented May 15, 2024

dramanica commented May 23, 2024

dramanica commented May 23, 2024

Euphrasiologist commented May 23, 2024

dramanica commented May 23, 2024 • edited Loading

dramanica commented Jun 21, 2024 • edited Loading

Euphrasiologist commented Mar 15, 2024 •

edited

Loading

dramanica commented May 23, 2024 •

edited

Loading

dramanica commented Jun 21, 2024 •

edited

Loading