-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CallDuplexConsensusReads hard-clips ONE of the two duplex reads #831
Comments
@genegolts what version of fgbio are you using? There was some behavior difference between 2.0 and 1.x. |
It's v. 2.0.2 |
Will need to fine to to look at this, that's going to be tough for this week. My apologies for the delay. |
Can you run |
Just ran it. It outputs reads identical to my inputs. Thanks for looking into this! |
We've run into this bug as well. We noticed a drop in coverage after Running each step on identical input, A pair that gets trimmed correctly
A pair that results in one untrimmed read Note that the problematic read has
Using fgbio v1.6 results in consensus pairs with 135bp reads and 81bp respectively. Using fgbio v2 results in identical results except that the second read in the second pair is not clipped and is 142 bp long. Our best guess is that this is related to this change: #761 |
@genegolts would you mind testing your case with the changes from #842? |
Pulled commit 3f1e664 and re-built the target. Still getting the same unevenly clipped consensus pair. Would it help if I uploaded the mini-bam file I'm testing this on? |
@genegolts I think I can explain your case. During consensus making, the consensus maker will clip bases that extend past the mate. It does this by comparing the read's left-most position to the mate's left-most position (this change is introduced in fgbio v2). So when looking at the negative strand, it is comparing to the positive strand's primary alignment start position (which is half way through the read). So it then clips off all the bases in the negative strand that extend past that. This would not be fixed by #842 and will need a different fix, although still related to #761 |
@mjhipp Thank you for the explanation, it makes perfect sense. In my use case scenario, this behavior results in the sequence at chimeric or fusion breakpoints getting hard-clipped. I can see how in most situations you would indeed want to exclude unaligned sequence from the consensus, but in my case I actually don't want that to happen. I am wondering if I could make a feature request to add a parameter that would disable this type of hard-clipping. |
@genegolts feature requests welcome, though to be transparent we’re all volunteer and so there’d be no time line on when we’d look at it (unless it one of clients wanted us to look at it). |
It's v. 2.0.2
…On Mon, Apr 25, 2022 at 1:30 PM Nils Homer ***@***.***> wrote:
@genegolts <https://github.com/genegolts> what version of fgbio are you
using? There was some behavior difference between 2.0 and 1.x.
—
Reply to this email directly, view it on GitHub
<#831 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY4NOR6NI5OJQZIQM2XSXFTVG36FXANCNFSM5UJR5UCQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
This message may contain privileged and / or confidential information. If
you are NOT the intended recipient of this message, copying, printing,
disseminating, forwarding or any other use or action derived from its
content is strictly prohibited. Please notify the sender immediately by
e-mail if you have received this e-mail by error and delete this e-mail
from your system. If you received the email by error and this message
contains patient information, please report the error by contacting the
Personalis Clinical Laboratory at ***@***.***
***@***.***>.
|
Hello again. One common use case for consensus reads is to detect "split alignments" as a sign of structural variation in a genome, and in this scenario you would absolutely want unclipped consensus sequences going into the aligner. This presents a dilemma since in the beginning of the process, in order to generate umi groups, clipped aligned positions of the raw reads are used. I would still argue though that once the UMI group has been identified, the consensus generation step should be able to (optionally) use the entire sequence of the constituent reads, without aggressive hard-clipping. (More sensitive criteria could be added, for example, recognizing when the negative strand protrudes past the aligned start of the positive strand which is itself soft-clipped, i.e., evidence that the aligned start might not be the actual start of the molecule.) In other words, the resulting duplex consensus sequence in this scenario would not necessarily represent the start/end of the molecule but would be a true consensus of the reads obtained from that molecule. I think this feature would be welcomed by many users and would add value to this already excellent tool! I'd love to hear your thoughts on this. |
Hello fgbio folks,
I am encountering an issue where one of the consensus sequences in a duplex pair gets clipped to the insert size while the other read doesn't get clipped. Here is an example. First, these are the UMI reads going in:
Notice that the pairs overlap each other almost fully, with a 1bp overhang. The insert size here was set based on the alignment, so are the soft-clip tags. This is not the true insert size, and I suspect this has something to do with the odd output
Here is the output of the program ( java -jar -Xmx32G -XX:ActiveProcessorCount=1 /home/ggoltsman/utils/fgbio/target/scala-2.13/fgbio-2.0.2-57a72b4-SNAPSHOT.jar CallDuplexConsensusReads --min-reads=4 0 0 --consensus-call-overlapping-bases false
--input=UMI_1153019.s.bam ). The second consensus read is clipped to 73 bp while the first one is unclipped:
I've verified that if I change the insert size in the input bam to reflect the full read length then there is no clipping in the output. This is the behavior I want, and I'm perfectly happy to change my upstream steps to have the insert size reflect the reality, but I would really appreciate if someone explained what's behind the uneven clipping so that I understand the program better.
Thank you in advance!
The text was updated successfully, but these errors were encountered: