Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compound key in hail #1127

Merged
merged 3 commits into from
Jan 31, 2025
Merged

Fix compound key in hail #1127

merged 3 commits into from
Jan 31, 2025

Conversation

MattWellie
Copy link
Contributor

@MattWellie MattWellie commented Jan 31, 2025

Correction to #1125

VEP outputs don't have REF/ALF alleles as individual attributes, they only have them within the larger input string, so we need to do some string manipulation to get them

See https://github.com/populationgenomics/production-pipelines/pull/602/files#r1936638900

I previously addressed these issues in a concept PR/branch for making seqr_loader consistently normalise/split multiallelic variants 😅

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 26.67%. Comparing base (9d69a59) to head (551f159).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1127   +/-   ##
=======================================
  Coverage   26.67%   26.67%           
=======================================
  Files           9        9           
  Lines        1732     1732           
=======================================
  Hits          462      462           
  Misses       1270     1270           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@EddieLF EddieLF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good

@MattWellie MattWellie merged commit 94808cc into main Jan 31, 2025
4 checks passed
@MattWellie MattWellie deleted the fix-compound-key-in-Hail branch January 31, 2025 04:00
# we're using split multiallelics, so we need to compound key on chr/pos/ref/alt
ht = ht.annotate(
locus=hl.locus(ht.vep.seq_region_name, hl.parse_int(ht.vep.input.split('\t')[1])),
alleles=ht.vep.allele_string.split('/'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to make this split on '/' OR '|'

Given our push to LR we are going to start seeing phased VCFs more and more so it would be good to build that into our muscle memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A split on one of two different separators isn't explicitly available as a hail method, but we can do a replace for | -> /, then split on the latter 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bit deal for this PR in particular - just one of those footguns that makes me nervous about basic string parsing a VCF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bookmark this and have a look for some phased calls in the final MT. I think this might be fine as this is the Alleles string, but the phasing is in the GT field. I would expect that even for 0|1 the alleles being passed to VEP are still in A/T format, with the option of being A/C/T for multiallelics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants