-
Hello, I have a vcf file and Im comparing it with a truth set. v1, tp-call: v3, fp: I appreciate for your reply~ best, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The main difference is that 1.3 used Levenshtein distance to calculate sequence similarity whereas 3.5 uses edlib. Another difference may be that v1.3 replaced The short-term solution is 'unresolved' SVs (those with ref=GRCh38_1kg_mainchrs.fa
in_vcf=original.vcf.gz
out_vcf=resolved.vcf.gz
zcat $in_vcf \
| python sequence_fixer.py $ref \
| bcftools norm --check-ref s --fasta-ref $ref -N -m-any \
| bgzip > $out_vcf
tabix $out_vcf where import sys
import pysam
ref_fn = sys.argv[1]
vcf = pysam.VariantFile("/dev/stdin")
ref = pysam.FastaFile(ref_fn)
out = pysam.VariantFile("/dev/stdout", 'w', header=vcf.header)
for entry in vcf:
# Put in DEL sequence
if entry.alts[0] == '<DEL>':
entry.ref = ref.fetch(entry.chrom, entry.start, entry.stop)
entry.alts = [entry.ref[0]]
elif entry.alts[0].startswith('<'):
# remove INV/DUP/INS which aren't sequence resolved
continue
out.write(entry) Long-term, I'll need to do one or more of the following:
|
Beta Was this translation helpful? Give feedback.
-
I was looking through the v1 code from 4 years ago and v3 code from a year ago and starting to remember all the details of exactly how it works. But the real answer is that non-sequence-resolved calls (e.g. I'm staging for a v4.1 to be released in a month or so and in it I'll try to put in the solutions listed previously. |
Beta Was this translation helpful? Give feedback.
I was looking through the v1 code from 4 years ago and v3 code from a year ago and starting to remember all the details of exactly how it works.
But the real answer is that non-sequence-resolved calls (e.g.
<DEL>
) can't be relied upon for anything related to PctSeqSimilarity and shouldn't be used (-p 0
) as you've already seen. Additionally, I'd recommending moving to Truvari v4.0 which has full documentation in the wiki on how exactly sequences are compared.However, the same answer applies and non-sequence-resolved calls shouldn't have sequence similarity calculated.I'm staging for a v4.1 to be released in a month or so and in it I'll try to put in the solutions listed previously.