-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant difference between IDV and AD when calling certain RNA-seq indels with mpileup #2277
Comments
What version of bcftools is this? The latest one has a revised indel caller with |
@jkbonfield Thank you for the suggestion. I'm using the latest bcftools 1.20. I tried rerunning the command with --indels-cns but unfortunately it doesn't resolve this issue. Somehow the REF AD number is also off this time:
|
Is there test data available? Also sorry I don't use IGV and don't understand the display. What are the shaded boxes showing? Why don't all alignments have shading? Is it soft-clipping for example? (In which case the reads are tiny.) Seeing the SAM file would be helpful. |
Indel_AD_GRCm38.sam.txt |
So the https://github.com/samtools/bcftools/blob/develop/bam2bcf_edlib.c#L1567-L1573 Commenting out those lines gives me this:
Putting it through bcftools call gives me QUAL of 106 and GT 0/1, matching the most likely PL information from above. My question to myself now is why I ever added those lines to filter out ref skips? |
In answer to my own question - it wasn't me who added those. They were duplicated (by request) from bam2bcf_indel.c, which itself was moved from Samtools. They appeared here in 2011 with no real explanation other than to fix a bug. What bug I have no idea. The same code exists in bam2bcf_indel.c, which is what you get if you don't use It's clearly code that is long overdue for reevaluation. It's quite possible whatever it was that broke bcftools calling when reads had ref-skips in them is no longer affected. It certainly seemed to give us the correct answer anyway. |
Mpileup removes alignments using the cigar ref skip operator ("N"). This was originally added in 2011 in samtools/samtools#d1643d6 with the commit message of "fixed a bug in indel calling related to unmapped and refskip reads". Unfortunately I don't know what that bug was, but removing the code shows it still works (at least for some data!). We need better understanding of what's going on and why it was added, so perhaps we should add a command line option to control this instead? Fixes samtools#2277
Mpileup removes alignments using the cigar ref skip operator ("N"). This was originally added in 2011 in samtools/samtools#d1643d6 with the commit message of "fixed a bug in indel calling related to unmapped and refskip reads". Unfortunately I don't know what that bug was, but removing the code shows it still works (at least for some data!). We need better understanding of what's going on and why it was added, but this PR makes it optional, keeping the default as before. Note there appears to be no filtering of BAM_CREF_SKIP in indels-2.0 so the option would be a nop there. This is an alternative PR to samtools#2281. I've leave it to the project maintainer as to what is preferable: removing the (no longer needed?) filtering, or keeping the default behaviour identical and adding a new option instead (which is safer, but possibly leads to accidental bad calls due to not noticing a new option has appeared). Fixes samtools#2277
Thank you so much for figuring this out! I agree that reads with Ns in CIGARs shouldn't be simply filtered out like that since they are abundant in RNA-seq libraries. So for now the solution is to comment out bam2bcf_edlib.c:1567-1573 and bam2bcf_indel.c:853-859 and re-compile bcftools? |
That works, or check out one of the two PRs above (the second makes it optional so is probably the one Petr will go with) and you don't need to edit the code, but the changes are minimal obviously. You only need to change one bit too. bam2bcf_edlib.c is for the |
I'm using mpileup to compute the mutation frequencies of a list of known variant sites in RNA-seq. For a small number of indels the AD numbers in the output are much lower than what I'm seeing in IGV. For instance, when I run the following command:
bcftools mpileup -r 11:69587265 -f <GRCm38 fasta> --annotate FORMAT/AD,FORMAT/DP,INFO/AD -F 0.001 --max-depth 10000 --max-idepth 10000 -Q20 -x -A --no-BAQ --tandem-qual 10000 <RNA-seq bam> | grep -v "^#"
Here is the output I'm getting:
11 69587265 . CA CAA 0 . INDEL;IDV=77;IMF=0.616;DP=124;AD=47,2;I16=24,23,1,1,1880,75200,80,3200,872,17236,40,800,1057,25169,46,1066;QS=0.908467,0.0915332;VDB=0.56;SGB=-0.453602;RPBZ=-0.571191;MQBZ=2.56432;MQSBZ=0;BQBZ=0;SCBZ=-1.87886;MQ0F=0 PL:DP:AD 0,106,55:49:47,2
And here is how this site looks like in IGV:
I am aware that some low quality reads are filtered during variant calling and the reported AD number will be lower than the raw IDV. However upon manual examination most of the reads seem to be high-quality and correctly aligned. Such a huge drop from 77 to 2 seems counter-intuitive to me and I wonder if this is the expected behavior of mpileup.
(One thing I notice is that most of the read pairs containing the insertion are spliced except 2, which happens to be the same as the AD reported)
The text was updated successfully, but these errors were encountered: