Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CallDuplexConsensusReads hard-clips ONE of the two duplex reads #831

Open
genegolts opened this issue Apr 25, 2022 · 14 comments
Open

CallDuplexConsensusReads hard-clips ONE of the two duplex reads #831

genegolts opened this issue Apr 25, 2022 · 14 comments
Labels

Comments

@genegolts
Copy link

Hello fgbio folks,
I am encountering an issue where one of the consensus sequences in a duplex pair gets clipped to the insert size while the other read doesn't get clipped. Here is an example. First, these are the UMI reads going in:

@hd VN:1.6 SO:unsorted GO:query SS:unsorted:template-coordinate
@sq SN:22 LN:51304566
@sq SN:hs37d5 LN:35477943
@rg ID:A SM:GT-30-2903_EOT-plasma LB:GT-30-2903_EOT-plasma PL:illumina
A00138:929:HF2Y7DSX3:2:1353:11930:36558 99 22 24313293 60 68S73M1S = 24313293 73 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGGA FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476687,+,65M77S,60,0; MC:Z:69S73M MD:Z:74 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:74 XS:i:19 RX:Z:CTA-GGT
A00138:929:HF2Y7DSX3:2:1353:11930:36558 147 22 24313293 60 69S73M = 24313293 -73 ATTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476686,-,66M76S,60,0; MC:Z:68S73M1S MD:Z:73 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:73 XS:i:19 RX:Z:CTA-GGT
A00138:929:HF2Y7DSX3:2:1377:26422:28087 99 22 24313293 60 68S73M1S = 24313293 73 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGGA FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476687,+,65M77S,60,0; MC:Z:69S73M MD:Z:74 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:74 XS:i:19 RX:Z:CTA-GGT
A00138:929:HF2Y7DSX3:2:1377:26422:28087 147 22 24313293 60 69S73M = 24313293 -73 ATTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:22,24476686,-,66M76S,60,0; MC:Z:68S73M1S MD:Z:73 RG:Z:A MI:Z:1153019/A NM:i:0 MQ:i:60 AS:i:73 XS:i:19 RX:Z:CTA-GGT

Notice that the pairs overlap each other almost fully, with a 1bp overhang. The insert size here was set based on the alignment, so are the soft-clip tags. This is not the true insert size, and I suspect this has something to do with the odd output

Here is the output of the program ( java -jar -Xmx32G -XX:ActiveProcessorCount=1 /home/ggoltsman/utils/fgbio/target/scala-2.13/fgbio-2.0.2-57a72b4-SNAPSHOT.jar CallDuplexConsensusReads --min-reads=4 0 0 --consensus-call-overlapping-bases false
--input=UMI_1153019.s.bam ). The second consensus read is clipped to 73 bp while the first one is unclipped:

GT-30-2903_EOT-plasma:1153019 77 * 0 0 * * 0 0 TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqfqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqfqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq aD:i:5 bD:i:1 cD:i:6 aE:f:0 bE:f:0 cE:f:0 RG:Z:A MI:Z:1153019 aM:i:5 bM:i:1 cM:i:6 RX:Z:CTA-GGT ac:Z:TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG bc:Z:TTTGGTTTTTGTGAGCATGCAGGAGCACTAGTTTTTTGAATAATTTTTTCTTATGCACAGCTCCTGTGCAGAGGTCTGGTCCCATTTTGGGGCTGAGGACAGGAGCCTCAGGTCAGCAGTCTGCAGGAAGGCCCCAGCTGG ad:B:s,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 bd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 ae:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,be:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 aq:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN bq:Z:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD9DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD9DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
GT-30-2903_EOT-plasma:1153019 141 * 0 0 * * 0 0 CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq aD:i:5 bD:i:1 cD:i:6 aE:f:0 bE:f:0 cE:f:0 RG:Z:A MI:Z:1153019 aM:i:5 bM:i:1 cM:i:6 RX:Z:CTA-GGT ac:Z:CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG bc:Z:CCAGCTGGGGCCTTCCTGCAGACTGCTGACCTGAGGCTCCTGTCCTCAGCCCCAAAATGGGACCAGACCTCTG ad:B:s,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 bd:B:s,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,ae:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 be:B:s,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 aq:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN bq:Z:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

I've verified that if I change the insert size in the input bam to reflect the full read length then there is no clipping in the output. This is the behavior I want, and I'm perfectly happy to change my upstream steps to have the insert size reflect the reality, but I would really appreciate if someone explained what's behind the uneven clipping so that I understand the program better.

Thank you in advance!

@nh13 nh13 added the question label Apr 25, 2022
@nh13
Copy link
Member

nh13 commented Apr 25, 2022

@genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x.

@genegolts
Copy link
Author

It's v. 2.0.2

@nh13
Copy link
Member

nh13 commented Apr 25, 2022

Will need to fine to to look at this, that's going to be tough for this week. My apologies for the delay.

@nh13
Copy link
Member

nh13 commented Apr 25, 2022

Can you run fgbio SetMateInformation for now?

@genegolts
Copy link
Author

Just ran it. It outputs reads identical to my inputs. Thanks for looking into this!

@genegolts
Copy link
Author

genegolts commented Apr 26, 2022

Just to add visuals to the story, here's a slide describing the problem.
image

@jrm5100
Copy link
Contributor

jrm5100 commented May 16, 2022

We've run into this bug as well. We noticed a drop in coverage after FilterDuplexConsensusReads when upgrading to fgbio 2.0.2 despite GroupReadsByUmi being more permissive (keeping pairs with one unmapped read) which resulted in more consensus reads going into that step.

Running each step on identical input, FilterDuplexConsensusReads is unchanged, but CallDuplexConsensusReads trims some pairs differently. I can't provide sequences here, but this information may help narrow it down. These inputs are from the output of GroupReadsByUmi using fgbio v1.6

A pair that gets trimmed correctly

NB501910_HNMCVBGX7:2:11110:16491:14314  99      chr12   25245156        60      135M7S  =       25245156        135     
NB501910_HNMCVBGX7:2:11110:16491:14314  147     chr12   25245156        60      7S135M  =       25245156        -135 

A pair that results in one untrimmed read

Note that the problematic read has 79M with an 81bp template size.

NB501910_HNMCVBGX7:1:11301:15626:18726  83      chr12   112489160       60      61S81M  =       112489160       -81     
NB501910_HNMCVBGX7:1:11301:15626:18726  163     chr12   112489160       60      79M63S  =       112489160       81      

Using fgbio v1.6 results in consensus pairs with 135bp reads and 81bp respectively.

Using fgbio v2 results in identical results except that the second read in the second pair is not clipped and is 142 bp long.

Our best guess is that this is related to this change: #761

@mjhipp
Copy link
Contributor

mjhipp commented May 17, 2022

@genegolts would you mind testing your case with the changes from #842?

@genegolts
Copy link
Author

Pulled commit 3f1e664 and re-built the target. Still getting the same unevenly clipped consensus pair. Would it help if I uploaded the mini-bam file I'm testing this on?

@mjhipp
Copy link
Contributor

mjhipp commented May 18, 2022

@genegolts I think I can explain your case. During consensus making, the consensus maker will clip bases that extend past the mate. It does this by comparing the read's left-most position to the mate's left-most position (this change is introduced in fgbio v2). So when looking at the negative strand, it is comparing to the positive strand's primary alignment start position (which is half way through the read). So it then clips off all the bases in the negative strand that extend past that. This would not be fixed by #842 and will need a different fix, although still related to #761

@genegolts
Copy link
Author

@mjhipp Thank you for the explanation, it makes perfect sense. In my use case scenario, this behavior results in the sequence at chimeric or fusion breakpoints getting hard-clipped. I can see how in most situations you would indeed want to exclude unaligned sequence from the consensus, but in my case I actually don't want that to happen. I am wondering if I could make a feature request to add a parameter that would disable this type of hard-clipping.
In any case, thank you very much for answering my questions!

@nh13
Copy link
Member

nh13 commented Jun 27, 2022

@genegolts feature requests welcome, though to be transparent we’re all volunteer and so there’d be no time line on when we’d look at it (unless it one of clients wanted us to look at it).

@genegolts
Copy link
Author

genegolts commented Oct 11, 2022 via email

@genegolts
Copy link
Author

genegolts commented Oct 12, 2022

Hello again. One common use case for consensus reads is to detect "split alignments" as a sign of structural variation in a genome, and in this scenario you would absolutely want unclipped consensus sequences going into the aligner. This presents a dilemma since in the beginning of the process, in order to generate umi groups, clipped aligned positions of the raw reads are used. I would still argue though that once the UMI group has been identified, the consensus generation step should be able to (optionally) use the entire sequence of the constituent reads, without aggressive hard-clipping. (More sensitive criteria could be added, for example, recognizing when the negative strand protrudes past the aligned start of the positive strand which is itself soft-clipped, i.e., evidence that the aligned start might not be the actual start of the molecule.)

In other words, the resulting duplex consensus sequence in this scenario would not necessarily represent the start/end of the molecule but would be a true consensus of the reads obtained from that molecule. I think this feature would be welcomed by many users and would add value to this already excellent tool! I'd love to hear your thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants