Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero length read alignments make ReadWalkers crash. #373

Closed
vruano opened this issue Apr 14, 2015 · 33 comments
Closed

Zero length read alignments make ReadWalkers crash. #373

vruano opened this issue Apr 14, 2015 · 33 comments
Assignees
Labels
Milestone

Comments

@vruano
Copy link
Contributor

vruano commented Apr 14, 2015

Doing some empirical data testing I found some instance of reads that have no single based aligned with the reference.

E.g. CIGAR: 50I2S so insertion followed by soft-clip.

That causes ReadWalker to crash when trying to create SimpleInterval on the read with a IAE.

I guess the solution is to add additional Wellformed filter.

@vruano vruano added the bug label Apr 14, 2015
vruano pushed a commit that referenced this issue Apr 14, 2015
…ingle base aligned with the reference.

Fixes issue #373
@vruano vruano self-assigned this Apr 14, 2015
vruano pushed a commit that referenced this issue Apr 14, 2015
…ingle base aligned with the reference.

Fixes issue #373
@lbergelson
Copy link
Member

Adding a new filter seems like the right thing. Why is this read aligned like this? It seems like a bug...

@vruano
Copy link
Contributor Author

vruano commented Apr 14, 2015

No clue, if not BWA (or whatever aligner) some post-alignment editing that went wrong.

I have seem occurrences of this in at least 3 WEx bams that I presume have been produced using good practices.

@vruano
Copy link
Contributor Author

vruano commented Apr 14, 2015

Test fail in ApplyBQSRIntegrationTest after the change so It might be some of these in the test input.

@akiezun
Copy link
Contributor

akiezun commented Apr 14, 2015

what does GATK3 do?

@droazen
Copy link
Contributor

droazen commented Apr 14, 2015

In GATK3, this check is performed by the BadCigarFilter

@droazen
Copy link
Contributor

droazen commented Apr 14, 2015

Perhaps we should just port BadCigarFilter in its entirety, or would that be overkill? It is a pretty expensive read filter...

@akiezun
Copy link
Contributor

akiezun commented Apr 14, 2015

yes, port. worry about speed later

On Tue, Apr 14, 2015 at 3:13 PM, droazen [email protected] wrote:

Perhaps we should just port BadCigarFilter in its entirety, or would that
be overkill? It is a pretty expensive read filter...


Reply to this email directly or view it on GitHub
#373 (comment)
.

@vruano vruano removed their assignment Apr 14, 2015
@akiezun akiezun added this to the BlobFish milestone Apr 15, 2015
@akiezun akiezun self-assigned this Apr 15, 2015
@akiezun
Copy link
Contributor

akiezun commented Apr 24, 2015

@vruano can you clarify? The specific CIGAR "50I2S" that you mentioned is not flagged as bad by BadCigarFilter in GATK3. Can you provide test cases for this issue?

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

I see, then I guess is another filter that is getting rid of it in GATK3... @droazen pointed out to that filter class and I guess we just assumed that BadCigarFilter was getting rid of it without actually checking whether that is true.

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

It appeared to me upon a quick inspection that the old BadCigarFilter was doing this check (via its "hasMeaningfulElements" condition), but I guess I was mistaken. It seems to be making sure that there is at least one operator that consumes read bases, but doesn't do the equivalent check for operators that consume reference bases.

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

Then should be another filter or it might be the case that GATK3 does not get rid of it but it tolerates them ; it does not blow like Hellbender/ReadWalker is doing at the moment.

@akiezun
Copy link
Contributor

akiezun commented Apr 24, 2015

@vruano so a bam file with 1 read and cigar 50I2S would work in GATK3 (for print reads) and blow up on hellbender?

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

If we want the hellbender engine to tolerate these reads, we can probably do that very easily -- just need to modify ReadWalker to set readInterval to null for these reads, like it currently does with unmapped reads:

            .forEach(read -> {
                final SimpleInterval readInterval = read.isUnmapped() ? null :
                                                                        new SimpleInterval(read.getContig(), read.getStart(), read.getEnd());
                apply(read,
                      new ReferenceContext(reference, readInterval), // Will create an empty ReferenceContext if reference or readInterval == null
                      new FeatureContext(features, readInterval));   // Will create an empty FeatureContext if features or readInterval == null
            });

We should confirm that GATK3 actually tolerates these reads, though -- if it instead filters them out, I'd advocate for filtering them in hellbender as well.

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

The way I got an exception is by running a ReadWalker on one of the CEU WEx...

/humgen/gsa-hpprojects/dev/valentin/WExCNv/bams/CEUTrio.HiSeq.WEx.b37_decoy.NA12891.bam

using the broad target file
/humgen/gsa-hpprojects/GATK/bundle/current/b37/Broad.human.exome.b37.interval_list

I presume that that bams have been extensively analyzed using GATK3.X and older and so my conclusion/conjecture.

Since I'd got the same kind of problem on another file

/humgen/gsa-hpprojects/dev/valentin/WExCNv/bams/CEUTrio.HiSeq.WEx.b37_decoy.HG03006.bam

now I'm not sure which one is the one that includes that particular CIGAR but I presume is the same error mode.

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

What is the appropriate check for these reads? if ( read.getCigar().getReferenceLength() == 0 ) ?

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

I guess that an N (or P?) operation in the CIGAR may result in a non-zero reference length without guarantee that there is a single base of the read aligned anywere.

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

Perhaps an efficient form of

read.getAlignmentBlocks().stream().mapToInt(AlignmentBlock::getLength).sum() > 0

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

That is quite a check to be performing for every single read just to handle this edge case....perhaps a filter would be better?

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

That would be the filter :)
But yeah is inefficient.

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

(also, getAlignmentBlocks() has no equivalent in the GA4GH Read API, so this approach wouldn't be viable)

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

What do they have instead?

@vruano
Copy link
Contributor Author

vruano commented Apr 24, 2015

Can you share a pointer to that API?

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

It's just not in the API. I suppose if we needed it we could write a utility method that constructs the alignment blocks from the Cigar.

@droazen
Copy link
Contributor

droazen commented Apr 24, 2015

@akiezun
Copy link
Contributor

akiezun commented Apr 27, 2015

pull req #380 is now merged. assigning to @vruano to decide if this is enough for this issue and what to do if it's not

@akiezun akiezun removed this from the BlobFish milestone Apr 27, 2015
@akiezun akiezun assigned vruano and unassigned akiezun Apr 27, 2015
@akiezun akiezun removed the in_review label Apr 27, 2015
@vruano
Copy link
Contributor Author

vruano commented Apr 29, 2015

Afaik, is not as the original CIGAR reported above "50I2S" remains a valid one and the the exception would come up.

There are at least three ways to handle this:

  1. Make 0-length SimpleIntervals valid.
  2. All ReadWalkers filter reads with no base aligned to the reference.
  3. Consider no base aligned reads bad formed so they are generally excluded.

@droazen
Copy link
Contributor

droazen commented Apr 29, 2015

There is a 4th option, which I mentioned above -- modify ReadWalker to set readInterval to null for these reads, like it currently does with unmapped reads:

        .forEach(read -> {
            final SimpleInterval readInterval = read.isUnmapped() ? null :
                                                                    new SimpleInterval(read.getContig(), read.getStart(), read.getEnd());
            apply(read,
                  new ReferenceContext(reference, readInterval), // Will create an empty ReferenceContext if reference or readInterval == null
                  new FeatureContext(features, readInterval));   // Will create an empty FeatureContext if features or readInterval == null
        });

This approach has the advantage of allowing ReadWalkers to process these reads if they want to -- they'll just get empty (but non-null) ReferenceContexts and FeatureContexts, just like they do for unmapped reads.

@vruano
Copy link
Contributor Author

vruano commented Apr 29, 2015

Could you comment on what would be the problem in allowing for SimpleInterval to have 0-length.

@droazen
Copy link
Contributor

droazen commented Apr 29, 2015

0-length intervals open up a huge can of worms (not much code is designed to be able to handle them, and none of the htsjdk query interfaces we rely on work properly with them). Modifying ReadWalker to set readInterval to null for these reads is a more conservative fix for the problem at hand that allows these reads to be processed without trying to query overlapping reference/feature data for them.

@vruano
Copy link
Contributor Author

vruano commented Apr 29, 2015

I guess that those problems are all GenomeLoc(Parser) inheritance and since we are breaking away from GenomeLoc perhaps is a good time to lift this restriction as well; I don't think being conservative at this point is necessarily a good thing considering the longer road ahead.

@droazen
Copy link
Contributor

droazen commented Apr 29, 2015

The right way to approach this ticket is not to make a huge change that requires a massive amount of work across both our codebase and our dependencies, but instead to make a targeted fix that addresses the original bug. If we want to support zero-length intervals, that should be a separate ticket, as it's a non-trivial task to do correctly.

I'll add: unless the semantics of using zero-length intervals are precisely defined in every case, we should probably not allow them. Eg., what should happen when you query using a zero-length interval? Should you always get back nothing, or should you get back abutting records? How does overlap work? Allowing intervals to be zero-length complicates everything immensely -- what is the tangible benefit of allowing them?

@vruano
Copy link
Contributor Author

vruano commented Apr 29, 2015

In that case, whatever the people wanna do. 4 then?

@akiezun
Copy link
Contributor

akiezun commented Feb 6, 2016

closed by #1474

@akiezun akiezun closed this as completed Feb 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants