-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero length read alignments make ReadWalkers crash. #373
Comments
…ingle base aligned with the reference. Fixes issue #373
…ingle base aligned with the reference. Fixes issue #373
Adding a new filter seems like the right thing. Why is this read aligned like this? It seems like a bug... |
No clue, if not BWA (or whatever aligner) some post-alignment editing that went wrong. I have seem occurrences of this in at least 3 WEx bams that I presume have been produced using good practices. |
Test fail in ApplyBQSRIntegrationTest after the change so It might be some of these in the test input. |
what does GATK3 do? |
In GATK3, this check is performed by the |
Perhaps we should just port |
yes, port. worry about speed later On Tue, Apr 14, 2015 at 3:13 PM, droazen [email protected] wrote:
|
@vruano can you clarify? The specific CIGAR "50I2S" that you mentioned is not flagged as bad by BadCigarFilter in GATK3. Can you provide test cases for this issue? |
I see, then I guess is another filter that is getting rid of it in GATK3... @droazen pointed out to that filter class and I guess we just assumed that BadCigarFilter was getting rid of it without actually checking whether that is true. |
It appeared to me upon a quick inspection that the old |
Then should be another filter or it might be the case that GATK3 does not get rid of it but it tolerates them ; it does not blow like Hellbender/ReadWalker is doing at the moment. |
@vruano so a bam file with 1 read and cigar 50I2S would work in GATK3 (for print reads) and blow up on hellbender? |
If we want the hellbender engine to tolerate these reads, we can probably do that very easily -- just need to modify
We should confirm that GATK3 actually tolerates these reads, though -- if it instead filters them out, I'd advocate for filtering them in hellbender as well. |
The way I got an exception is by running a ReadWalker on one of the CEU WEx... /humgen/gsa-hpprojects/dev/valentin/WExCNv/bams/CEUTrio.HiSeq.WEx.b37_decoy.NA12891.bam using the broad target file I presume that that bams have been extensively analyzed using GATK3.X and older and so my conclusion/conjecture. Since I'd got the same kind of problem on another file /humgen/gsa-hpprojects/dev/valentin/WExCNv/bams/CEUTrio.HiSeq.WEx.b37_decoy.HG03006.bam now I'm not sure which one is the one that includes that particular CIGAR but I presume is the same error mode. |
What is the appropriate check for these reads? |
I guess that an N (or P?) operation in the CIGAR may result in a non-zero reference length without guarantee that there is a single base of the read aligned anywere. |
Perhaps an efficient form of read.getAlignmentBlocks().stream().mapToInt(AlignmentBlock::getLength).sum() > 0 |
That is quite a check to be performing for every single read just to handle this edge case....perhaps a filter would be better? |
That would be the filter :) |
(also, |
What do they have instead? |
Can you share a pointer to that API? |
It's just not in the API. I suppose if we needed it we could write a utility method that constructs the alignment blocks from the Cigar. |
Afaik, is not as the original CIGAR reported above "50I2S" remains a valid one and the the exception would come up. There are at least three ways to handle this:
|
There is a 4th option, which I mentioned above -- modify ReadWalker to set
This approach has the advantage of allowing |
Could you comment on what would be the problem in allowing for SimpleInterval to have 0-length. |
0-length intervals open up a huge can of worms (not much code is designed to be able to handle them, and none of the |
I guess that those problems are all GenomeLoc(Parser) inheritance and since we are breaking away from GenomeLoc perhaps is a good time to lift this restriction as well; I don't think being conservative at this point is necessarily a good thing considering the longer road ahead. |
The right way to approach this ticket is not to make a huge change that requires a massive amount of work across both our codebase and our dependencies, but instead to make a targeted fix that addresses the original bug. If we want to support zero-length intervals, that should be a separate ticket, as it's a non-trivial task to do correctly. I'll add: unless the semantics of using zero-length intervals are precisely defined in every case, we should probably not allow them. Eg., what should happen when you query using a zero-length interval? Should you always get back nothing, or should you get back abutting records? How does overlap work? Allowing intervals to be zero-length complicates everything immensely -- what is the tangible benefit of allowing them? |
In that case, whatever the people wanna do. 4 then? |
closed by #1474 |
Doing some empirical data testing I found some instance of reads that have no single based aligned with the reference.
E.g. CIGAR: 50I2S so insertion followed by soft-clip.
That causes ReadWalker to crash when trying to create SimpleInterval on the read with a IAE.
I guess the solution is to add additional Wellformed filter.
The text was updated successfully, but these errors were encountered: