-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compound key in hail #1127
Fix compound key in hail #1127
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1127 +/- ##
=======================================
Coverage 26.67% 26.67%
=======================================
Files 9 9
Lines 1732 1732
=======================================
Hits 462 462
Misses 1270 1270 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good
# we're using split multiallelics, so we need to compound key on chr/pos/ref/alt | ||
ht = ht.annotate( | ||
locus=hl.locus(ht.vep.seq_region_name, hl.parse_int(ht.vep.input.split('\t')[1])), | ||
alleles=ht.vep.allele_string.split('/'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy to make this split on '/' OR '|'
Given our push to LR we are going to start seeing phased VCFs more and more so it would be good to build that into our muscle memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A split on one of two different separators isn't explicitly available as a hail method, but we can do a replace for |
-> /
, then split on the latter 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bit deal for this PR in particular - just one of those footguns that makes me nervous about basic string parsing a VCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bookmark this and have a look for some phased calls in the final MT. I think this might be fine as this is the Alleles string, but the phasing is in the GT field. I would expect that even for 0|1
the alleles being passed to VEP are still in A/T
format, with the option of being A/C/T
for multiallelics
Correction to #1125
VEP outputs don't have REF/ALF alleles as individual attributes, they only have them within the larger input string, so we need to do some string manipulation to get them
See https://github.com/populationgenomics/production-pipelines/pull/602/files#r1936638900
I previously addressed these issues in a concept PR/branch for making seqr_loader consistently normalise/split multiallelic variants 😅