-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local alleles - Add LPL field #273
Conversation
I think the test failures are related to #258 - the specific numbers we're checking against here depend on the version of numpy. Looks like we need to explicitly pin a numpy version somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress, thanks @Will-Tyler. I've left a few minor comments.
It would be good to have a source of external truth here for computing these values. It's a pity that the bcftools thing didn't work out.
tests/test_vcf_examples.py
Outdated
@@ -475,13 +515,18 @@ def test_ok_without_local_alleles(self, ds): | |||
|
|||
class Test1000G2020Example: | |||
data_path = "tests/data/vcf/1kg_2020_chrM.vcf.gz" | |||
expected_call_lpl_path = "tests/data/numpy/expected_call_LPL.npy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no so sure about how well this works as a testing strategy. I think testing against pre-computed values like this works fine when it's for small examples and you can see them in the code. But here, as a code reader, I don't know where this array comes from and it doesn't give me any reassurance that the values have been computed correctly.
I think a better approach would be to keep a copy of the 1kg_2020_chrM.vcf file that has these LPL values computed by some other method. Is there any way we can do this, other than using Hail? (Which would be a faff, I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the expected values in a file because the ruff formater was reformatting the expected values into a giant mess. I will put the expected values back into the Python code and try disabling the ruff linter / formatter for the expected values. If we don't like that, maybe we can consider removing the test case. I'm not aware of any other ways to compute the local-allele fields. I checked bcftools again—they updated the tag2tag plugin yesterday, but it still does not support XX to LXX conversions.
I did manually inspect the expected values to make sure that all looked correct in comparison with the test dataset. Even if the test takes some manual effort to verify the first time, I think the test is still valuable for detecting changes.
[154, 22, 0, -2, -2, -2, -2, -2, -2, -2], | ||
], | ||
[ | ||
[1002, 55, 1002, 0, 55, 1002, -2, -2, -2, -2], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dimensions don't look right to me - why do we have so much padding? Shouldn't the inner dimension be 6 here, not 10? (That's the main point of the exercise here, to reduce the size of the last dimension in PL fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inner dimension is 10 because at POS 3351, the first sample has a diploid genotype and three alternate allele indices. (I have pasted the test data in this pull request's description for easy reference.) Therefore, the LPL field should have ten entries, one for each possible genotype from the alleles, including the reference allele. In this case, all of the values are the missing value because the entry does not have a PL field.
I think that if we reduced the size by truncating the data, it would be more complicated for data consumers. We also need to support the case where only some, but not all of the PL values are missing, right? If so, I can add a test case for this.
bio2zarr/vcf2zarr/icf.py
Outdated
@@ -1183,6 +1247,25 @@ def process_partition(self, partition_index): | |||
else: | |||
format_fields.append(field) | |||
|
|||
# We need to determine LAA before LPL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's taken me a good while to figure this out - maybe we could express it more simply?
if "LAA" in format_fields and "LPL" in format_fields:
laa_index = format_fields.index("LAA")
lpl_index = format_fields.index("LPL")
# LAA needs to come before LPL
if lpl_index < laa_index:
format_fields[laa_index] = "LPL"
format_fields[lpl_index] = "LAA"
That's the same thing, right? Let's now worry about optimising for reducing the number of scans through the format_fields array.
I'm not sure that the test failures depend on the NumPy version. When I change my environment to use Python 3.10, the tests pass. When I change the environment to use Python 3.11, I get the failures seen in the CI. In both cases, I used NumPy version 1.26.0. EDIT: I believe the breaking change is in numcodecs 0.13.0, which includes a change to the zstd compression algorithm (release notes). When I use version 0.13.0, I get the test failure. When I use version 0.12.1, the same tests pass. Version 0.13.0 was released on July 12th (source), the day that I opened this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks @Will-Tyler! I think I need to play with this a bit now and see how it goes on some real datasets.
I think we just need to drop the pin on numcodecs (issued fixed in #275) and we're good to merge.
This reverts commit ef8af9a.
d9023b6
to
7e88757
Compare
Should be rebased now with the numcodecs version unpinned. I'm excited to see how well this works in practice and happy to implement needed improvements that you identify. Thanks for reviewing! |
Overview
In this pull request, vcf2zarr computes a genotype-level, local alleles field, LPL, during the explode step unless specifically told not to. The LPL field is related to the LAA field. The LAA field is a list of one-based indices into the variant-level
ALT
field that indicates which alleles are relevant (local) for the current sample. The LPL field is a list of the Phred-scaled genotype likelihoods for the genotypes associated with the reference allele and the alleles given by the LAA field. The source of truth for the LPL field is the LAA field and the PL field. The PL field is a list of the Phred-scaled genotype likelihoods for all possible genotypes in the variant.This pull request makes progress on #185. What remains is to prevent the creation and data storage of the PL field when vcf2zarr the local-allele fields are available.
Testing
I update and add unit tests based on the unit tests for the LAA field.
Here is the data from
local_alleles.vcf.gz
:These should cover the following cases:
References