-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpileup revamp #1474
Mpileup revamp #1474
Conversation
Demonstration of the results based on known truth sets (SynDip and GIAB) are in the previous PR here: I need to redo these again as a few tweaks have happened since, if only to check I haven't created any regressions. When done I'll post here, but if someone has a specific truth set they want testing then do please let me know. Note it must be a truth set, and not just "does it increase TS/TV". My experiments there showed it often makes that metric poorer, while getting more results correct, as the previously missing variants were enriched for low-complexity data. |
As expected I spotted some more minor code tidyups immediately on posting this, so I'll correct and squash again. My apologies. |
Also a minor refactoring of bcf_call_glfgen for speed. BAQ is the dominant CPU time for mpileup. However it's not needed for the bulk of reads. We delay BAQ until we observe that a read has an overlap with others containing an indel. No indels means no BAQ. Furthermore, if we do have an indel, but this particular read spans the indel with more than 50bp either side then also don't realign it. This is based on the observation that most misalignments occur near the ends of reads where the optimal pair-wise alignment (read vs ref) differs from the optimal multiple alignment (read vs all others). Evaluations so far show this is around a 1.5 to 2-fold speed increase (depending on tuning parameters), while simultaneously slightly increasing SNP calling accuracy. The latter appears to be due to BAQ in non-indel regions sometimes reducing all reads to QUAL 0 in low complexity areas, blocking SNP calling from working. SNPs flanking indels also get blocked for similar reasons. This has two variants (see REALN_DIST #if) as an exploration of methods. This is enabled with a new -D option. By default it is as before, and -B still disables completely. If we like how it works though it's possible we could make this the default and add a +B for all-BAQ, -B for no-BAQ and have the default inbetween as auto-select BAQ. TODO: Consider adding a depth filter as low depth portions aren't optimal. Honour MPLP_REDO_BAQ in mpileup
The previous calc_mwu_bias returns values in the range 0 to 1. In experimentation I can find little correlation between these values and the likelihood of a variant being true or false. This is odd, given GATK's MQRankSum does a good job here. It appears it's the squashing of Mann Whitney U score to this range which may be penalising it, but perhaps there is something different in implementations too? (I started again from scratch based on Wikipedia.) This new function takes the same approach as GATK, implementing the Z-score version where a result of 0 means both distributions are comparable and above / below zero is used to indicate how extreme the U score is in either direction. Under testing appears to offer better discrimination between true and false SNPs, particularly for MQB. For clarity, we have new MQBZ, RPBZ, BQBZ and MQSBZ fields as the range of values differs so old filtering rules would be inappropriate. On a test set, this filter rule seemed to work effectively: -e "QUAL < $qual || DP>$DP || MQB < -4.5 || RPB > 4 || RPB < -4" On Chr22 of HG002.GRCh38.60x.bam this gave for 3 different values of $qual: SNP Q>0 / Q>=0 / Filtered SNP TP 39777 / 39777 / 39769 SNP FP 542 / 542 / 483 SNP FN 1316 / 1316 / 1324 InDel TP 4998 / 4998 / 4981 InDel FP 983 / 983 / 52 InDel FN 1945 / 1945 / 1962 SNP Q>0 / Q>=100 / Filtered SNP TP 39777 / 39567 / 39559 SNP FP 542 / 94 / 71 SNP FN 1316 / 1526 / 1534 InDel TP 4998 / 3934 / 4297 InDel FP 983 / 756 / 22 InDel FN 1945 / 3009 / 2646 SNP Q>0 / Q>=200 / Filtered SNP TP 39777 / 38074 / 38068 SNP FP 542 / 36 / 27 SNP FN 1316 / 3019 / 3025 InDel TP 4998 / 3200 / 2643 InDel FP 983 / 570 / 10 InDel FN 1945 / 3743 / 4300 Filtered is as per Q>=x but with the additional MQB / RPB filters. Versus naive QUAL filtering, at Q>=100 it removes 24.5% of the FP while increasing FN by 0.5%. Indel filtering is essential to remove the excessive FPs and used: -e "IDV < 3 || IMF < 0.08 || DP>$DP || QUAL < $qual/5 || \ ((5+$qual/6)/(1.01-IMF) < $qual && IDV<(3+$DP*$qual/1000))" [Something like this needs folding into the indel QUAL score as by itself it is a very poor indicator of quality. It's still not up to par even with it, but that's for a future project.] NB: Above scores are also with a modified BAQ calculation, to come in another PR. The old MWU test also now sports a uni-directional mode: The old p(U) calc gave eg <-4 and >4 both as near 0, but for some calculations we expect a bias in one direction and not the other. Eg MQB (mapping quality bias) is skewed as REF matching reads have higher MQUAL and ALT matching reads. The U and Z scores are now selectable via #define using the same function, but U is default for compatibility. Also add a soft-clip MWU test. This looks for REF / ALT bias in the numbers of soft-clipped bases near variants. NB: Unsure if it's helpful currently.
This is a multiplicative factor (albeit 1/x) on the BAQ score in the indel caller. Given the constant, arbitrary, cap of 111 this is more or less equivalent to permitting higher scores. The effect is to be more permissive in calling, giving greater recall rate at a cost of slightly reduced precision. A value of 1 would match the previous computation, but the new default is 2 (=> score * 1/2). On a 30x GIAB HG002 chr1 sample filtering a Q20 this removes 1/3rd of the false negatives, but at a cost of about 1/3rd more FP. At 60x it's about -40% FN + 20% FP. (Given the quantity of true calls hugely outweights false, this is a reasonable trade.) Also adjusted the default -h parameter from 100 to 500. This gives a huge reduction in false negative rates (as much as 3 fold), but also usually has a slightly beneficial decrease in FP. This parameter is used to indicate the likely sequencing error rates in homopolymers, which (for Illumina instruments at least) have greatly reduced since the original bcftools indel caller was written. High -h parameters though does marginally harm SNP calling precision, but not massively. Minimum base quality has changed from 13 to 1. Higher quality clip doesn't seem to convert to more accurate calls. Adjusted the minimum number of reads needed to call an indel from 1 to 2, and fraction from 0.2% to 5%.
These values weren't captured in the indel code so we can't post-filter on them. It's hard to know what is ALT and what is REF, but for now we take the naive approach of a read that doesn't have an indel is REF and ALT otherwise, irrespective of quality of match. Of these, MQBZ shows some discrimination (< -5 for 30x and < -9 for 150x), but RPBZ and SCBZ have overlapping distributions (although RPBZ may just about have a slim margin worthy of filtering at the right end). That said for both the dist of the true variants extends much further negative than the dist for the false ones, so while we cannot remove FP without removing TP, we could at least use this for computing a refined QUAL value.
These are now computed using their own arrays, as there was some leakage with the SNP stats. Also noticed while assessing the new metrics that there is a very high rate of incorrect indel genotypes, particularly HOM to HET. After experimentation, I realised that seqQ governs which candidate types end up in the genotype, and not indelQ. seqQ is now capped at indelQ and no higher. This has had a huge reduction in GT miscalls. [With amended compare_vcf.sh script to now validate GT also matches, in their own output line] 15x GIAB HG002 before: SNP Q>0 / Q>=10 / Filtered SNP TP 257554 / 255433 / 255224 SNP FP 1846 / 1049 / 998 SNP GT 883 / 853 / 882 SNP FN 6756 / 8877 / 9086 InDel TP 42021 / 41651 / 40762 InDel FP 709 / 594 / 347 InDel GT 7326 / 7306 / 7242 InDel FN 3583 / 3953 / 4842 15x GIAB HG002 after: SNP Q>0 / Q>=10 / Filtered SNP TP 257554 / 255433 / 255224 SNP FP 1846 / 1049 / 998 SNP GT 883 / 853 / 882 SNP FN 6756 / 8877 / 9086 InDel TP 43008 / 42651 / 41723 InDel FP 820 / 706 / 454 InDel GT 1098 / 1087 / 1036 InDel FN 2596 / 2953 / 3881 Note reduction to indel FN too (and modest increase in FP). It's even more pronounced on deep data. 150x GIAB HG002 before: SNP Q>0 / Q>=10 / Filtered SNP TP 262711 / 262628 / 262523 SNP FP 1947 / 1621 / 1323 SNP GT 253 / 244 / 239 SNP FN 1599 / 1682 / 1787 InDel TP 44451 / 44435 / 44384 InDel FP 4617 / 4455 / 1279 InDel GT 6333 / 6327 / 6321 InDel FN 1153 / 1169 / 1220 150x GIAB HG002 after: SNP Q>0 / Q>=10 / Filtered SNP TP 262711 / 262628 / 262523 SNP FP 1947 / 1621 / 1323 SNP GT 253 / 244 / 239 SNP FN 1599 / 1682 / 1787 InDel TP 45061 / 45053 / 44992 InDel FP 4798 / 4628 / 1408 InDel GT 448 / 446 / 441 InDel FN 543 / 551 / 612 Here we're seeing a 14 fold reduction in the rate of incorrect indel genotype assignment, and a 2 fold reduction in false negative rates. Also adjusted the default indel-bias down again from 2 to 1. This tweak was needed before, but the change in scoring function makes it better again with the default params.
The Short Tandem Repeat from Crumble has been added to the indel caller. This is used to calculate the STRBZ INFO metric (not so useful I think, so maybe to remove) and to skew the indel score slightly for reads overlapping an indel site that start or end in an STR. At low filtering levels I get these results or HG002 chr1: 15x previous: Q>0 / Q>=10 / Filtered InDel TP 42256 / 41744 / 41082 InDel FP 601 / 540 / 375 InDel GT 1265 / 1234 / 1191 InDel FN 3348 / 3860 / 4522 15x this: Q>0 / Q>=10 / Filtered InDel TP 42354 / 41806 / 41186 InDel FP 554 / 501 / 343 InDel GT 1180 / 1151 / 1121 InDel FN 3250 / 3798 / 4418 That's a decrease to all of FP, GT and FN. At 150x previous: Q>0 / Q>=10 / Filtered InDel TP 44854 / 44838 / 44775 InDel FP 3541 / 3387 / 869 InDel GT 517 / 516 / 507 InDel FN 750 / 766 / 829 At 150x this: Q>0 / Q>=10 / Filtered InDel TP 44917 / 44897 / 44890 InDel FP 778 / 753 / 750 InDel GT 412 / 409 / 410 InDel FN 687 / 707 / 714 This is also a decrease to all of FP, GT and FN. In particular unfiltered FP has a huge drop, so naive usage of bcftools will be considerably more accurate. TODO: this clearly works well for indels. Maybe consider adding it for SNPs too?
I noticed a few issues while doing this, but they'll be resolved in subsequent commits as it'd be nigh on impossible to identify the changes if included here. This should give identical results to the previous commit, aiding validation that the refactoring hasn't changed anything.
The checking for tbeg/tend was incorrect as one is in ref coords and the other is local to the buffer, making the test basically always true. This just accepts that and removes the condition, as the end result appears to show the code works.
- Remove STRBZ and BQBZ as unuseful stats. - Only compute stats once per indel rather than per type. - Replace some pointless mallocs with fixed arrays as the maximum number of types was already hard-coded. - General code tidyups
It only needs to check first few and last few elements to obtain soft-clip data, as it's trying to turn absolute position / length into clip-adjusted position / length. This function can easily be 30%+ of the CPU time when using long read technologies.
These needs new parameters for BAQ (also analous ones in htslib realn.c) as the error models differ with most errors in indels instead of substitutions. Detection of platform is very crude currently - just length. TODO: use @rg PL field and DS containing "CCS". Protect against attempts to call BAQ on negatived sized regions. "tend" can become less than "tbeg" as tend is left-justified and tbeg right-justified when it comes to deletions. If the entire left..right portion is contained within a single cigar D op, then tend<tbeg. Protect against excessive BAQ band widths, which is both very costly in CPU time (with ONT I saw one single BAQ call take 30 mins), and uses vast amounts of memory. For now this is a hard cutoff, but we may consider making it adjustable. BAQ has minimal benefit to PacBio CCS and it's untested on ONT as it triggered the out-of-memory killer. Also add a maximum read length for BAQ (--max-read-len) so we can trivially simply disable it for all long-read data while keeping it enabled on short reads, eg mixed data sets with ONT, PacBio and Illumina.
This was already done for long reads, but it turns out the second call is detrimental for the short read data sets I tested too.
There is now a maximum quality value, defaulting to 60 so above Illumina values. This is useful or PacBio CCS which produces abnormally high qualities leading to over-certainty in calls, especially for incorrect genotype assignment. SNP base quality is now the minimum of this base qual plus the base quality either-side with an additional delta (0 to just use MIN of this / neighbour, but also permitting neighbour to only affect us if significantly lower). CCS data commonly has a neighbouring low qual value adjacent to in incorrect but high quality substitution. However it turns out this is also beneficial to the HG002 Illumina data. An example of the effect for Illumina 60x data (with adjustment on min QUAL to get the same FN rate as the scores are slightly suppressed now): Before: SNP Q>0 / Q>=20 / Filtered SNP TP 262541 / 262287 / 262283 SNP FP 2313 / 1442 / 1405 SNP GT 287 / 269 / 269 SNP FN 1769 / 2023 / 2027 After: SNP Q>0 / Q>=15 / Filtered SNP TP 262503 / 262298 / 262294 SNP FP 1787 / 1349 / 1312 -6.6% SNP GT 283 / 268 / 268 ~= SNP FN 1807 / 2012 / 2016 -0.5% For 31x PacBio CCS, the quality assignment suppression is more pronounced due to the excessively high QUALs in the bam records, but the FN/FP tradeoff is the same (again tuning to try and compare with ~= FN): Before: SNP Q>0 / Q>=47 / Filtered SNP TP 263847 / 263693 / 263693 SNP FP 4908 / 3089 / 3089 SNP GT 804 / 793 / 793 SNP FN 463 / 617 / 617 After: SNP Q>0 / Q>=15 / Filtered SNP TP 263779 / 263692 / 263692 SNP FP 3493 / 2973 / 2973 -10% SNP GT 509 / 503 / 503 -37% SNP FN 531 / 618 / 618 ~= Finally, added -X,--config STR option to specify sets of parameters tuned at specific platforms. Earlier experiments showed the indel caller on PacBio CCS data needs very different gap-open and gap-extend parameters. It also benefits a bit from raising minimum qual to 5 and because this is a new option we also enable partial realignment by default. The ONT configuration disables indels (they're *glacial* to compute at the moment for long-cigar reads and really don't give any real accuracy at present), disables BAQ, and tweaks min(filter) and max(clip) qualities. Man page updates for new options: --max-read-len (-M), --config (-X), --max-BQ and --delta-BQ.
The fixes in commit 3c1205c while still here, no longer worked in this branch due to the swapping of seqQ and baseQ. The fix however seemed inappropriate and caused other issues, such as samtools#1446. The new logic is to check p->indel to decide if the read needs b=0 (REF) setting rather than a naive qual check. This greatly reduces the false positive rate of indel calling, especially when deep, however it also increases GT errors a bit. I've tried to get some middle ground where the new seqQ used for REF calls isn't mostly the old qual, but with a partial mix in of the base qual. I don't like heuristics, but it works well on the data sets tested and fixes both the issue addressed by 3c1205c and also samtools#1446, while keeping the FP reduction. Beneficial at 30x, 60x and 150x both filtered and unfiltered (huge win). At 15x it's a small win for unfiltered and a moderate loss for filtered, almost entirely due to higher GT error rate. Possibly this is filtering rules no longer appropriate, or maybe it's simply not good on shallower data. Needs investigation.
f0f5891
to
e4e1610
Compare
For backwards compatibility, the MWU_ZSCORE #ifdef has been replaced by the -U option. The default is the new Z score, but we can now output the same INFO fields as before.
The -D option is now --full-BAQ and reinstates the previous mpileup default.
3e2105f
to
354b7b5
Compare
I tidied up a couple of those commits to remove old debug code or fix some no-longer relevant comments. I also added a couple new commits.
A question regarding that, should we undo the uni-directional tweak to MWU that I put in for mapping quality, so it is identical? (Also maybe should it call the old implementation, which still exists but does give slightly different scores I think?) |
I confirmed I got identical results to my more previous PR, so the mass juggling of commits didn't break things it appears. I decided to cross-validate my assessments using a different tool, this time RTG's vcf-eval by ploting the produced roc files (without Bcftools improved, albeit mostly due to fixing the default parameters for low coverage. On higher coverage the new code is doing much better, beating GATK even. This is vastly different. I couldn't zoom up to the more appropriate top-right corner as v1.12 doesn't get remotely close. It's clear indels are still poor in Bcftools. Compared to GATK it simply has too many false positives (reducing precision), however the recall is good. The old code had very poor recall and precision on this data set. |
This is a rebased, reordered / squashed version of #1363.
A summary of the changes are:
BAQ is now tri-state; fully on (still the default), off, and partial. Partial BAQ only runs BAQ on reads which overlap potentially problematic indels. This is both a significant speed increase, but also an improvement in calling rates as BAQ can sometimes be detrimental so we only want to use it when we have a significant chance of fixing something.
Added a new calc_mwu_biasZ function (Z-score) to replace calc_mwu_bias. These are currently used if
MWU_ZSCORE
is defined in bam2bcf.h (which it is). If we decide we prefer these metrics, then we can remove this ifdef and the old code, but for now there is a compilation choice to be made. (And no user way to configure.) Alternatively it could become a run-time choice with a bit more work. Please advise.Added MQBZ and RPBZ INFO fields to indels, for filtering on.
Added a soft-clip bias SCBZ for SNPs, based on length of soft-clips near the SNP site, indicating potential misalignments. (Although questionable how helpful this is.)
Very significant improvement to GT indel assignment accuracy. In some data sets it reduces incorrect assignment by an order of magnitude.
Adjusted default parameters to mpileup. -h has gone from 100 to 500, -Q from 13 to 1, -F from 0.002 to 0.05 and -m from 1 to 2.
Added a
--indel-bias
mpileup option to skew FN vs FP. This appears to be better than modifying call-P
as it's specific to indels.Changed the indel scoring algorithm, giving better FP vs FN discrimination.
Added the short tandem repeat finder from Crumble. This is used for correcting the indel score. Put simply, indels in low-complexity regions having a higher bar to pass than those in normal regions. This is a modest improvement to FN, FP and genotype assignment errors. (To consider - it may also be worth utilising this on SNPs too.)
A major refactoring of the extremely long
bcf_call_gap_prep
function. This was done part way through my work, meaning it's very hard to reorder / squash commits from both sides of that refactor commit. You may also wish to diff in two stages - up to that commit and beyond that commit, as otherwise it's hard to see what's changed.get_position()
has been rewritten, making it significantly faster on long read data. Previously it was quite easy for a 1/3rd of the entire CPU time to be spent here. Now it's negligible.Improved calling accuracy for long-read technologies. This has mainly been tested on PacBio CCS data. ONT SNP calling is significantly more accurate too, but indels are still a right-off and glacial. There is more work to be done on the indel caller's use of BAQ, which cripples speed on long-read technology.
The indel caller now only has one set of
probaln_glocal
parameters instead of two. I initially removed the second to speed up long-read calling, but it improved it too. It turned out fortuitously to also improve the Illumina data. I assume it helped at one point, but on the more modern data sets it served no purpose (was detrimental).Added
-X, --config
(although perhaps --platform is better? Feel free to suggest) to mpileup. This is simply a collection of parameter values, tuned to work on that specific name. It could perhaps auto-detect by platform RG fields, but I feel being explicit is likely the safer option.Added a
--max-BQ
option to clip (rather than discard) high quality bases. This is used to limit overly confident platforms from producing unrealistic SNP calls (I'm looking at you CCS!).Added a
--delta-BQ
option to permit the neighbouring base quality to be utilised, if lower by a specific delta to the SNP base itself. This helps CCS calling where often incorrect bases have the low confidence in the next base rather than the one in error.Replaced the heuristic for indel allele (REF vs ALT) with a newer, albeit more complex, one. This fixes bcftools call is detecting lots of REF reads that's not in the bam file (small test data set included) #1446 and significantly reduces indel false positive rates.
Some questions to consider:
Should we make
-D
the default? I think yes, but then how do we define full BAQ? I guess that can become -D or a similar option.I have no idea what impact this has on rare variants or on many-sample variant calling, as I don't have truth sets for either so have no way of validating this myself. I suggest interested parties offer up their services!
This will fail tests, as the output differs in so many ways. However I'm not really sure I'm the best person to judge which differences matter. I think that's a task for review perhaps. Obviously if we're happy, we can just generate new expected test results to silence it.