Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove check on PL values when computing local alleles #285

Closed

Conversation

jeromekelleher
Copy link
Contributor

WIP - hopefully closes #277

@jeromekelleher
Copy link
Contributor Author

Just disabling the check on PL values doesn't seem to help, so I took out the check on AD also. We should just be getting back the GTs now. This does definitely help:

jk@holly$ python3 -m bio2zarr vcf2zarr inspect tmp/chr22.vcz/ 
name                          dtype    stored      size          ratio    nchunks  chunk_size    avg_chunk_stored    shape              chunk_shape        compressor                                                      filters
----------------------------  -------  ----------  ----------  -------  ---------  ------------  ------------------  -----------------  -----------------  --------------------------------------------------------------  ------------
/call_PL                      int32    497.99 MiB  25.21 GiB     52            30  860.44 MiB    16.6 MiB            (96514, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_LPL                     int32    315.14 MiB  5.4 GiB       18            30  184.38 MiB    10.5 MiB            (96514, 2504, 6)   (10000, 1000, 6)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_AD                      int16    196.58 MiB  3.15 GiB      16            30  107.56 MiB    6.55 MiB            (96514, 2504, 7)   (10000, 1000, 7)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_DP                      int16    112.11 MiB  460.95 MiB     4.1          30  15.37 MiB     3.74 MiB            (96514, 2504)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_GQ                      int8     58.65 MiB   230.48 MiB     3.9          30  7.68 MiB      1.95 MiB            (96514, 2504)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_AB                      float32  30.97 MiB   921.9 MiB     30            30  30.73 MiB     1.03 MiB            (96514, 2504)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_PID                     object   8.64 MiB    1.8 GiB      210            30  61.46 MiB     295 KiB             (96514, 2504)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]
/call_LAA                     int8     8 MiB       460.95 MiB    58            30  15.37 MiB     273.23 KiB          (96514, 2504, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_genotype                int8     7.8 MiB     460.95 MiB    59            30  15.37 MiB     266.26 KiB          (96514, 2504, 2)   (10000, 1000, 2)   Blosc(cname='zstd', clevel=7, shuffle=BITSHUFFLE, blocksize=0)  None
/call_PGT                     object   6.74 MiB    1.8 GiB      270            30  61.46 MiB     230.1 KiB           (96514, 2504)      (10000, 1000)      Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   [VLenUTF8()]

I don't think we've quite got it right though, because if we're just doing local alleles based on the genotypes, the local alleles should all be biallelic and therefore the LPL field should have dimension 3. So there's definitely problems, but we're on the right track.

@Will-Tyler
Copy link
Contributor

therefore the LPL field should have dimension 3

In the case where the LAA has two alternate alleles, vcztools outputs 6 LPL values because the reference allele is always "included". This is necessary because the LAA field does not indicate whether the reference allele is included or not.

See LAA definition in VCF 4.5.

@jeromekelleher
Copy link
Contributor Author

Aha yes, that's true, good point. I don't know why the VCF standard decided to list just the non-ref alleles, that seems like an obvious waste here. Oh well, that's the standard, there's no point in thinking too hard about it.

Happy to finish the fix you started.

If you could pick this up and fix the tests that would be super helpful, thanks @Will-Tyler !

@jeromekelleher
Copy link
Contributor Author

Superseded by #286

@jeromekelleher jeromekelleher deleted the local-alleles-updates branch December 3, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LPL no smaller than PL
2 participants