Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: the logic for the PileupBuilder includeMapPositionsOutsideFrInsert filter is wrong #980

Closed
jrm5100 opened this issue Apr 19, 2024 · 6 comments · Fixed by #981
Closed
Assignees

Comments

@jrm5100
Copy link
Contributor

jrm5100 commented Apr 19, 2024

positionIsOutsideFrInsert is actually positionIsInsideFrInsert

private def positionIsOutsideFrInsert(rec: SamRecord, refIndex: Int, pos: Int): Boolean = {
rec.isFrPair && {
val (start, end) = Bams.insertCoordinates(rec)
rec.refIndex == refIndex && pos >= start && pos <= end
}
}

The filter is used in a way that is reversed so that the current tests still work as designed.

if (compare) compare = if (!includeMapPositionsOutsideFrInsert && rec.isFrPair) {
PileupBuilder.positionIsOutsideFrInsert(rec, refIndex = refIndex, pos = pos)
} else { compare }

Additionally there is a redundant rec.isFrPair call (this is done inside the method).

I came across this because I was tracking down some aberrant alleles in a pileup. Should reads mapped to separate chromosomes be rejected by this filter, or should that be a separate filter (likely in quickAcceptRecord)? Currently these are accepted even if includeMapPositionsOutsideFrInsert is false, which i think may be misleading.

@clintval
Copy link
Member

clintval commented Apr 24, 2024

I confirm:

  1. The privatepositionIsOutsideFrInsert was actually positionIsInsideFrInsert
  2. Despite (1), the use of positionIsOutsideFrInsert was still correct in function
  3. The additional rec.isFrPair was unnecessary EDIT: see below, this was not true.

Check out #981. I can't assign you as a reviewer, but would like your 👀 on it.

@clintval
Copy link
Member

clintval commented Apr 24, 2024

The history of the filter started 5 years ago in the context of somatic variant filtration:

/**
* Filters the pileup to remove entries that are from PE reads who's insert size indicates that the
* aligned base we're looking at is _outside_ the insert!
*/
private def filterPileup[A <: PileupEntry](pileup: Pileup[A]): Pileup[A] = {
val filtered = pileup.pile.filter { e =>
if (e.rec.isFrPair) {
val (start, end) = Bams.insertCoordinates(e.rec)
if (pileup.pos < start || pileup.pos > end) {
logger.warning(
s"""
|Ignoring read ${e.rec.name} mapped over variant site ${pileup.refName}:${pileup.pos} that has
|incompatible insert size yielding insert coordinates of ${pileup.refName}:$start-$end which do not overlap the variant.
""".stripMargin.trim.linesIterator.mkString(" "))
false
}
else {
true
}
}
else {
true
}
}
pileup.copy(pile=filtered)
}

It was then refactored into the PileupBuilder and swapped from a negative filter (exclude) to a positive filter (include):

The filter still behaves correctly:

Action Behavior
True Do not apply the filter.
False Remove FR pairs where the position in that pair is outside the insert coordinates. Otherwise keep the record (single-end, RF pair, tandem pair).

Because of this semantic change, the includeMapPositionsOutsideFrInsert filter behaves correctly but its meaning might be unclear. Why would someone want to include map positions outside the FR pair? I now think the filter should behave exactly as it is now but be called:

  • includeMapPositionsInsideFrInsert

And its default set to true.

@clintval
Copy link
Member

What do you think of renaming includeMapPositionsOutsideFrInsert to includeMapPositionsInsideFrInsert and swapping the default boolean from false to true? It would be a breaking Scala API change, but gratefully the name change would require a developer who is using it to reconsider the filter when updating their use of the API.

@clintval clintval reopened this Apr 24, 2024
@clintval
Copy link
Member

clintval commented Apr 24, 2024

@jrm5100 I don't think we want to exclude records in a pair where a mate maps to a different reference sequence. Although these pairs would not be proper, they should still count towards coverage when requesting a pile up of reads overlapping either of the reads in the pair.

But if you do not want to include pairs with reads mapped to different reference sequences, you can add a custom filter like:

PileupBuilder(source).withReadFilter(rec => rec.refName == rec.mateRefName)

@jrm5100
Copy link
Contributor Author

jrm5100 commented Apr 24, 2024

I think the main sticking point here is the intention of the filter.

Anything not isFRPair includes:

  • unpaired
  • one mapped
  • separate contigs
  • RF/Tandem

Reads that are isFrPair may look like this:

        ─────────────────────────►     
                                       
   ◄────────────────────────           
                                                             
        │                  │           
        │   Insert Size    │           
        │                  │                            

and being OutsideFrInsert would mean in the 3' tails that likely aren't real.

So is the filter meant to:

  • Only keep data from FR pairs and only when the position is inside the insert (i.e. only data that has very little chance of being an artifact)
  • Keep those not isFrPair but when they are isFrPair, make sure they aren't outside the insert region.

The distinction is where isFrPair is used.

It seems like the second is the intended usage. In that case I think the current filter name and functionality is fine, but the internal function just needs to be renamed and used slightly differently:

if (compare) compare = if (!includeMapPositionsOutsideFrInsert && rec.isFrPair) {
          PileupBuilder.positionIsInsideFrInsert(rec, refIndex = refIndex, pos = pos)
        } else { compare }

In this case the && rec.isFrPair is actually necessary, since positionIsInsideFrInsert is False if the record isn't in a FR pair, and I think that was my main source of confusion for the intention of the filter.


@jrm5100 I don't think we want to exclude records in a pair where a mate maps to a different reference sequence. Although these pairs would not be proper, they should still count towards coverage when requesting a pile up of reads overlapping either of the reads in the pair.

But if you do not want to include pairs with reads mapped to different reference sequences, you can add a custom filter like:

PileupBuilder(source).withReadFilter(rec => rec.refName == rec.mateRefName)

Yeah, after reading through the code more I realized that and settled on

PileupBuilder(source).withReadFilter(rec => rec.isFrPair)

Which is slightly more strict, but works for my purposes.

@clintval
Copy link
Member

Yes, this was the original intent of the filter:

Keep those not isFrPair but when they are isFrPair, make sure they aren't outside the insert region

So I need to revert the change I just made and update the documentation to be more clear. The original rec.isFrPair wasn't redundant! I've just gotten myself confused along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants