Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compression on PL fields #53

Closed
jeromekelleher opened this issue Feb 29, 2024 · 10 comments
Closed

Improve compression on PL fields #53

jeromekelleher opened this issue Feb 29, 2024 · 10 comments

Comments

@jeromekelleher
Copy link
Contributor

On recent 1000 genomes data, we have the following:

39G   20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz

The zarr using defaults is:

43G     tmp/1kg_chr20_all.zarr

which is 4G more.

This is dominated by the FORMAT cols:

1.1G    tmp/1kg_chr20_all.zarr/call_AB                                                        
7.1G    tmp/1kg_chr20_all.zarr/call_AD                                                        
4.5G    tmp/1kg_chr20_all.zarr/call_DP                                                        
174M    tmp/1kg_chr20_all.zarr/call_genotype                                                  
31M     tmp/1kg_chr20_all.zarr/call_genotype_mask                                             
4.7M    tmp/1kg_chr20_all.zarr/call_genotype_phased    
6.2G    tmp/1kg_chr20_all.zarr/call_GQ                                                        
4.7M    tmp/1kg_chr20_all.zarr/call_MIN_DP                                                    
4.7M    tmp/1kg_chr20_all.zarr/call_MQ0                                                       
552M    tmp/1kg_chr20_all.zarr/call_PGT
966M    tmp/1kg_chr20_all.zarr/call_PID
22G     tmp/1kg_chr20_all.zarr/call_PL
4.7M    tmp/1kg_chr20_all.zarr/call_RGQ
4.7M    tmp/1kg_chr20_all.zarr/call_SB

In particular, call_PL is 22G. Hopefully there's some reasonably straightforward combination of filters and compressor options that'll bring this down.

@jeromekelleher
Copy link
Contributor Author

Part of the problem here is that call_PL contains some very large values (absurdly so, given they are log 10 values), and the column gets encoded as 32 bit ints by default. Chunks are roughly 24 or 25 MB.

Reducing to 8 bit ints is probably quite defensible. However, just changing the dtype to i1 here resulted in larger chunks (31-32Mb)! Setting shuffle to 0 seems to bring things down to something more sensible (17-18M). I've not looked at it systematically, though.

Really the only actual solution here is to encode the PL values using the "observed alleles" idea. Hopefully we can make this good enough for now, and save that one for implementation later. Although it's probably fairly straightforward, and may be quicker to just do and leave as an option rather than trying to explain why our files aren't that much smaller than VCFs.

@jeromekelleher
Copy link
Contributor Author

Doing i1 with no shuffle brings the total for PL down to 16G, so a 6G saving. But that's only a 2G saving over the original VCF which isn't that exciting.

@jeromekelleher
Copy link
Contributor Author

See sgkit-dev/vcf-zarr-publication#5 for discussion on local alleles discussed in SVCR paper

@shz9
Copy link

shz9 commented Mar 14, 2024

On the sample VCF that I'm experimenting with, I'm noticing that a lot of call_PL values are negative, presumably because they're outside the range of int16, the current default data type. I have two suggestions here:

  • If possible, make the data type unsigned integer uint16, because this field does not accommodate negative values anyway (unless missing?). This should automatically clip values larger than 2**16.
  • If we want to keep int16 nonetheless, I'd suggest clipping the PL values to the maximum that the integer data type allows before conversion.

@shz9
Copy link

shz9 commented Mar 14, 2024

In the VCF that I'm working with, the largest recurring value that I'm observing is 2147483647, which is the maximum value for int32 (2**(32-1) - 1). Probably capped because cyvcf2 (and htslib?) are using int32 to represent PL values. So, this is probably an arbitrary number that can be represented more compactly in another way.

The rest of the numbers are more reasonable, usually less than 100 and rarely exceeding 2**12.

Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs has a bit packing class for bool, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.

Depending on the sizes of these integers (or their differences), we could potentially losslessly store them in 6 bits or less. This may also depend on the data layout and how we compute the differences (within individual or across individuals for the same genotype?)

Happy to discuss this more.

@jeromekelleher
Copy link
Contributor Author

Both of these issues (negative PLs and PL value of ~2**32) sound pretty pathological @shz9, would you mind giving a bit more detail? (PLs are phred scaled log 10 values, so anything greater than like 100 seems insanely large to me.) Where did you get the VCF from? What does the output of vcf2zarr inspect <exploded-if> look like?

@jeromekelleher
Copy link
Contributor Author

Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. numcodecs has a bit packing class for bool, but we can create a similar one for integers by linking to some of Lemire's C libraries to do the packing and unpacking super fast.

For now I think we need to stick with what's available in numcodecs - adding new codecs etc would be something done at the Zarr level, rather than something we want to get involved in for now. If this catches on and there's a demonstrated need for better integer packing then that could certainly be brought forward as a proposal to the wider Zarr community.

I think we can probably get a long way with what we have, though (or at least, do an awful lot better than VCF!)

@shz9
Copy link

shz9 commented Mar 15, 2024

After digging a bit more into the VCF file I have, it seems that this is an issue with missing data. I'm working with a tiny subset of 1000G data that I found on our servers (not sure how it was processed), consisting of 91 samples and ~6k variants on chr22. The first line looks like this:

./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:3,0:3:9:0,9,104	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:3,0:3:9:0,9,104	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,59	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	0/0:.:2,0:2:6:0,6,70	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	0/0:.:1,0:1:3:0,3,35	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.	./.:.:0,0:0:.:.

Mostly missing data, which, for some reason, cyvcf2 just sets to be the largest possible integer value:

In [8]: for i, rec in enumerate(VCF("mini.vcf.gz")):
   ...:     print(rec.gt_phred_ll_homref)
   ...:     print(rec.gt_phred_ll_het)
   ...:     print(rec.gt_phred_ll_homalt)
   ...:     break
   ...: 
[2147483647 2147483647 2147483647 2147483647          0 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647          0 2147483647
 2147483647 2147483647 2147483647          0 2147483647 2147483647
 2147483647 2147483647          0 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647          0
 2147483647 2147483647 2147483647          0 2147483647          0
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647          0          0 2147483647
          0 2147483647          0 2147483647 2147483647 2147483647
          0 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
 2147483647]
...

When this gets translated to Zarr, the large values become negative:

In [1]: import zarr

In [2]: z = zarr.open("tmp/sample.zarr/")

In [4]: z.call_PL[:]
Out[4]: 
array([[[  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        ...,
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2],
        [  -1,   -2,   -2, ...,   -2,   -2,   -2]],
...

Again, this is a consequence of using the int16 data type. If I cast 2147483647 to int16 I get -1.

inspect didn't show me anything unusual about the call_PL array:

name                      type       chunks  size       compressed      max_n  min_val    max_val
------------------------  -------  --------  ---------  ------------  -------  ---------  ---------
CHROM                     String          4  57.34 KB   915 bytes           1  n/a        n/a
POS                       Integer         4  57.34 KB   9.53 KB             1  1.1e+07    1.1e+07
QUAL                      Float           4  57.34 KB   30.36 KB            1  30         3.7e+06
ID                        String          4  8.19 KB    234 bytes           0  n/a        n/a
FILTERS                   String          4  65.54 KB   902 bytes           1  n/a        n/a
REF                       String          4  57.34 KB   13.48 KB            1  n/a        n/a
ALT                       String          4  62.76 KB   18.92 KB            6  n/a        n/a
FORMAT/AB                 Float           4  233.82 KB  32.1 KB             1  n/a        n/a
FORMAT/AD                 Integer         4  464.12 KB  526.93 KB           7  0          1.2e+02
FORMAT/DP                 Integer         4  251.9 KB   392.14 KB           1  0          1.4e+02
FORMAT/GQ                 Integer         4  251.9 KB   418.42 KB           1  0          99
FORMAT/GT                 Integer         4  345.09 KB  53.48 KB            3  -1         6
FORMAT/MIN_DP             Integer         4  8.19 KB    234 bytes           0  n/a        n/a
FORMAT/MQ0                Integer         4  8.19 KB    234 bytes           0  n/a        n/a
FORMAT/PGT                String          4  186.27 KB  10.27 KB            1  n/a        n/a
FORMAT/PID                String          4  186.27 KB  13.54 KB            1  n/a        n/a
FORMAT/PL                 Integer         4  728.02 KB  1.22 MB            28  0          8e+03
...

@jeromekelleher
Copy link
Contributor Author

Ah - so the negative numbers here are how vcf-zarr encodes missing and fill values

I don't think there's any need to look at the VCF itself here, we're pretty certain that the data itself is being faithfully transcoded into zarr. The question here is whether we can use some simple combination of pre-existing Zarr filters and/or compression methods from Blosc to make compression a bit better.

Like I said above though, there's only one real fix for this specific problem with PL values and that's to use the approach described in the SVCR paper.

So, I think the time here would be better spent on figuring out how we can get better default settings for non PL fields, as discussed over in #74.

@jeromekelleher
Copy link
Contributor Author

Closing in favour of #185 - there's no point in trying to make PL fields slightly better here, the local alleles approach is the right thing to do.

@jeromekelleher jeromekelleher closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants