-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compression on PL fields #53
Comments
Part of the problem here is that call_PL contains some very large values (absurdly so, given they are log 10 values), and the column gets encoded as 32 bit ints by default. Chunks are roughly 24 or 25 MB. Reducing to 8 bit ints is probably quite defensible. However, just changing the dtype to i1 here resulted in larger chunks (31-32Mb)! Setting shuffle to 0 seems to bring things down to something more sensible (17-18M). I've not looked at it systematically, though. Really the only actual solution here is to encode the PL values using the "observed alleles" idea. Hopefully we can make this good enough for now, and save that one for implementation later. Although it's probably fairly straightforward, and may be quicker to just do and leave as an option rather than trying to explain why our files aren't that much smaller than VCFs. |
Doing i1 with no shuffle brings the total for PL down to 16G, so a 6G saving. But that's only a 2G saving over the original VCF which isn't that exciting. |
See sgkit-dev/vcf-zarr-publication#5 for discussion on local alleles discussed in SVCR paper |
On the sample VCF that I'm experimenting with, I'm noticing that a lot of
|
In the VCF that I'm working with, the largest recurring value that I'm observing is The rest of the numbers are more reasonable, usually less than 100 and rarely exceeding Assuming we can find a practical way to deal with the large arbitrary number, another possible solution that could benefit other data fields in the Zarr VCF format is integer packing with differential coding. Depending on the sizes of these integers (or their differences), we could potentially losslessly store them in 6 bits or less. This may also depend on the data layout and how we compute the differences (within individual or across individuals for the same genotype?) Happy to discuss this more. |
Both of these issues (negative PLs and PL value of ~2**32) sound pretty pathological @shz9, would you mind giving a bit more detail? (PLs are phred scaled log 10 values, so anything greater than like 100 seems insanely large to me.) Where did you get the VCF from? What does the output of |
For now I think we need to stick with what's available in numcodecs - adding new codecs etc would be something done at the Zarr level, rather than something we want to get involved in for now. If this catches on and there's a demonstrated need for better integer packing then that could certainly be brought forward as a proposal to the wider Zarr community. I think we can probably get a long way with what we have, though (or at least, do an awful lot better than VCF!) |
After digging a bit more into the VCF file I have, it seems that this is an issue with missing data. I'm working with a tiny subset of 1000G data that I found on our servers (not sure how it was processed), consisting of 91 samples and ~6k variants on chr22. The first line looks like this:
Mostly missing data, which, for some reason, In [8]: for i, rec in enumerate(VCF("mini.vcf.gz")):
...: print(rec.gt_phred_ll_homref)
...: print(rec.gt_phred_ll_het)
...: print(rec.gt_phred_ll_homalt)
...: break
...:
[2147483647 2147483647 2147483647 2147483647 0 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 0 2147483647
2147483647 2147483647 2147483647 0 2147483647 2147483647
2147483647 2147483647 0 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 0
2147483647 2147483647 2147483647 0 2147483647 0
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 0 0 2147483647
0 2147483647 0 2147483647 2147483647 2147483647
0 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647 2147483647 2147483647 2147483647 2147483647 2147483647
2147483647]
... When this gets translated to In [1]: import zarr
In [2]: z = zarr.open("tmp/sample.zarr/")
In [4]: z.call_PL[:]
Out[4]:
array([[[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
...,
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2],
[ -1, -2, -2, ..., -2, -2, -2]],
... Again, this is a consequence of using the
|
Ah - so the negative numbers here are how vcf-zarr encodes missing and fill values I don't think there's any need to look at the VCF itself here, we're pretty certain that the data itself is being faithfully transcoded into zarr. The question here is whether we can use some simple combination of pre-existing Zarr filters and/or compression methods from Blosc to make compression a bit better. Like I said above though, there's only one real fix for this specific problem with PL values and that's to use the approach described in the SVCR paper. So, I think the time here would be better spent on figuring out how we can get better default settings for non PL fields, as discussed over in #74. |
Closing in favour of #185 - there's no point in trying to make PL fields slightly better here, the local alleles approach is the right thing to do. |
On recent 1000 genomes data, we have the following:
The zarr using defaults is:
which is 4G more.
This is dominated by the FORMAT cols:
In particular,
call_PL
is 22G. Hopefully there's some reasonably straightforward combination of filters and compressor options that'll bring this down.The text was updated successfully, but these errors were encountered: