how did PctSeqSimilarity calculated? #163

Calliiii · 2023-08-01T08:22:46Z

Calliiii
Aug 1, 2023

Hello,

I have a vcf file and Im comparing it with a truth set.
I found that the result from truvari v1.3.4 was quite different from truvari v3.5.0 when using totally same comp and base vcf.
(precision 0.09 for v3 and 0.79 for v1)
I then compared two tp-call.vcf from v1 and v3, and looked into one call that appeared in v1 tp, but in v3 fp.
I found PctSeqSimilarity was quite different. And thats was the reason that this call not passed in v3. (I used the default pctsim threshold 0.7)
belows are details.

v1, tp-call:
chr1 1598413 call_1325 A <DEL> 155 PASS END=1598580;CIEND=0,0;CIPOS=0,0;SVTYPE=DEL;SVLEN=-167;PS=.;HAP_ALLELIC_FRAC=.;ALLELIC_FRAC=.;PAIRS=.;SPLIT=.;WildCov=.;MolTtl=.;MolTtlNoR=.;MolDel=.;MolDelNoR=.;MolWild=.;MolWildNoR=.;PVAL=.;BGSize=.;BGImparityPval=.;BGTtlRCnt=.;BGHP1RCnt=.;BGHP2RCnt=.;BGBaseCov=.;BGPhaseFrac=.;NRead=.;SOURCE=LOCAL_ASM;TruScore=1.33333;NumNeighbors=1;NumThresholdNeighbors=1;MatchId=3080;PctSeqSimilarity=1;PctSizeSimilarity=1;PctRecOverlap=1;SizeDiff=0;StartDistance=0;EndDistance=0 GT 1/1

v3, fp:
chr1 1598413 call_1325 A <DEL> 155 PASS END=1598580;CIEND=0,0;CIPOS=0,0;SVTYPE=DEL;SVLEN=-167;PS=.;HAP_ALLELIC_FRAC=.;ALLELIC_FRAC=.;PAIRS=.;SPLIT=.;WildCov=.;MolTtl=.;MolTtlNoR=.;MolDel=.;MolDelNoR=.;MolWild=.;MolWildNoR=.;PVAL=.;BGSize=.;BGImparityPval=.;BGTtlRCnt=.;BGHP1RCnt=.;BGHP2RCnt=.;BGBaseCov=.;BGPhaseFrac=.;NRead=.;SOURCE=LOCAL_ASM;PctSeqSimilarity=0.0118;PctSizeSimilarity=1;PctRecOverlap=1;SizeDiff=0;StartDistance=0;EndDistance=0;GTMatch;TruScore=67;MatchId=7.0.0 GT 1|1

I appreciate for your reply~

best,
Calli

Answered by ACEnglish

Aug 2, 2023

I was looking through the v1 code from 4 years ago and v3 code from a year ago and starting to remember all the details of exactly how it works.

But the real answer is that non-sequence-resolved calls (e.g. <DEL>) can't be relied upon for anything related to PctSeqSimilarity and shouldn't be used (-p 0) as you've already seen. Additionally, I'd recommending moving to Truvari v4.0 which has full documentation in the wiki on how exactly sequences are compared.However, the same answer applies and non-sequence-resolved calls shouldn't have sequence similarity calculated.

I'm staging for a v4.1 to be released in a month or so and in it I'll try to put in the solutions listed previously.

View full answer

ACEnglish · 2023-08-01T16:19:36Z

ACEnglish
Aug 1, 2023
Maintainer

The main difference is that 1.3 used Levenshtein distance to calculate sequence similarity whereas 3.5 uses edlib.

Another difference may be that v1.3 replaced <DEL> with reference sequence and I believe 3.5 may have incorrectly compared the string "<DEL>" to whatever the alternate sequence in the baseline set of calls. Presumably its identical since the SizeSimilarity/Overlap is 1. You can look in fn.vcf for "MatchId=7.0.0" in the 3.5 results to see the baseline call.

The short-term solution is 'unresolved' SVs (those with <TYPE> annotations instead of sequences) should either use -p 0 to turn off sequence similarity comparison or have their alt sequences filled beforehand in with something like:

ref=GRCh38_1kg_mainchrs.fa
in_vcf=original.vcf.gz
out_vcf=resolved.vcf.gz

zcat $in_vcf \
    | python sequence_fixer.py $ref \
    | bcftools norm --check-ref s --fasta-ref $ref -N -m-any  \
    | bgzip > $out_vcf
tabix $out_vcf

where sequence_fixer.py is:

import sys
import pysam

ref_fn = sys.argv[1]

vcf = pysam.VariantFile("/dev/stdin")
ref = pysam.FastaFile(ref_fn)

out = pysam.VariantFile("/dev/stdout", 'w', header=vcf.header)
for entry in vcf:
    # Put in DEL sequence
    if entry.alts[0] == '<DEL>':
        entry.ref = ref.fetch(entry.chrom, entry.start, entry.stop)
        entry.alts = [entry.ref[0]]
    elif entry.alts[0].startswith('<'):
        # remove INV/DUP/INS which aren't sequence resolved
        continue
    out.write(entry)

Long-term, I'll need to do one or more of the following:

figure out how to consolidate the new default sequence comparison approach with unresolved SVs and optional --reference to pull fix unresolved DELs for users
Make a tool with the above scripts to let users fix their own VCFs
Raise hard errors when sequence comparison is turned on and unresolved variants are present.

1 reply

Calliiii Aug 2, 2023
Author

Thanks for the reply. Yes, I checked this record in baseline vcf and the sequence is identical. I also found that the result of v3 was totally the same with v1 when I add -p 0.

But there are also some records that appear to be in both v1 and v3 tp, and ALT field was also <DEL>. I dont know why this time pctsim was correctly calculated...

details are below:
v1-tpcall:
chr1 3717002 call_1700 C <DEL> 140 PASS END=3717152;CIEND=0,0;CIPOS=0,0;SVTYPE=DEL;SVLEN=-150;PS=.;HAP_ALLELIC_FRAC=.;ALLELIC_FRAC=.;PAIRS=.;SPLIT=.;WildCov=.;MolTtl=.;MolTtlNoR=.;MolDel=.;MolDelNoR=.;MolWild=.;MolWildNoR=.;PVAL=.;BGSize=.;BGImparityPval=.;BGTtlRCnt=.;BGHP1RCnt=.;BGHP2RCnt=.;BGBaseCov=.;BGPhaseFrac=.;NRead=.;SOURCE=LOCAL_ASM;TruScore=1.19555;NumNeighbors=1;NumThresholdNeighbors=1;MatchId=3100;PctSeqSimilarity=0.995305;PctSizeSimilarity=1;PctRecOverlap=0.596026;SizeDiff=0;StartDistance=-61;EndDistance=-61 GT 0/1

v3-tpcall:
chr1 3717002 call_1700 C <DEL> 140 PASS END=3717152;CIEND=0,0;CIPOS=0,0;SVTYPE=DEL;SVLEN=-150;PS=.;HAP_ALLELIC_FRAC=.;ALLELIC_FRAC=.;PAIRS=.;SPLIT=.;WildCov=.;MolTtl=.;MolTtlNoR=.;MolDel=.;MolDelNoR=.;MolWild=.;MolWildNoR=.;PVAL=.;BGSize=.;BGImparityPval=.;BGTtlRCnt=.;BGHP1RCnt=.;BGHP2RCnt=.;BGBaseCov=.;BGPhaseFrac=.;NRead=.;SOURCE=LOCAL_ASM;PctSeqSimilarity=0.9453;PctSizeSimilarity=1;PctRecOverlap=0.596;SizeDiff=0;StartDistance=-61;EndDistance=-61;GTMatch;TruScore=84;MatchId=32.0.0 GT 0/1

ACEnglish · 2023-08-02T03:48:50Z

ACEnglish
Aug 2, 2023
Maintainer

I was looking through the v1 code from 4 years ago and v3 code from a year ago and starting to remember all the details of exactly how it works.

But the real answer is that non-sequence-resolved calls (e.g. <DEL>) can't be relied upon for anything related to PctSeqSimilarity and shouldn't be used (-p 0) as you've already seen. Additionally, I'd recommending moving to Truvari v4.0 which has full documentation in the wiki on how exactly sequences are compared.However, the same answer applies and non-sequence-resolved calls shouldn't have sequence similarity calculated.

I'm staging for a v4.1 to be released in a month or so and in it I'll try to put in the solutions listed previously.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how did PctSeqSimilarity calculated? #163

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

how did PctSeqSimilarity calculated? #163

Calliiii Aug 1, 2023

Replies: 2 comments · 1 reply

ACEnglish Aug 1, 2023 Maintainer

Calliiii Aug 2, 2023 Author

ACEnglish Aug 2, 2023 Maintainer

Calliiii
Aug 1, 2023

Replies: 2 comments 1 reply

ACEnglish
Aug 1, 2023
Maintainer

Calliiii Aug 2, 2023
Author

ACEnglish
Aug 2, 2023
Maintainer