Skip to content

kcleal/SV_Benchmark_CMRG_GIAB

Repository files navigation

📊 Long-read SV Benchmark CMRG/GIAB

This is a reproducible benchmark of structural variants for two benchmark datasets:

  1. Challenging Medically Relevant Genes. The truth set is described in detail here:

Curated variation benchmarks for challenging medically relevant autosomal genes. Wagner et al., 2022. Nature Biotechnology

  1. The Genome In A Bottle (GIAB) truthset v1.1 is also assessed.

GIAB

Requirements

  • Linux machine
  • ~200 Gb space, 32 gb Ram, 4 cores
  • Nextflow

plot

plot

ONT Results:

caller platform Benchmark TP FP FN precision recall f1 gt_concordance
sniffles ont CMRG 194 18 23 0.9151 0.894 0.9044 0.8763
cuteSV ont CMRG 197 10 20 0.9517 0.9078 0.9292 0.8934
severus ont CMRG 194 10 23 0.951 0.894 0.9216 0.5722
delly ont CMRG 194 18 23 0.9151 0.894 0.9044 0.8866
dysgu ont CMRG 203 16 14 0.9269 0.9355 0.9312 0.9015
sniffles ont GIAB 18382 1227 10164 0.9374 0.6439 0.7635 -
cuteSV ont GIAB 19101 1477 9738 0.9282 0.6623 0.7731 -
severus ont GIAB 17101 1170 10605 0.936 0.6172 0.7439 -
delly ont GIAB 17695 2516 9798 0.8755 0.6436 0.7419 -
dysgu ont GIAB 20370 1980 8176 0.9114 0.7136 0.8005 -

PacBio Results:

caller platform Benchmark TP FP FN precision recall f1 gt_concordance
sniffles pacbio CMRG 191 9 26 0.955 0.8802 0.9161 0.9162
cuteSV pacbio CMRG 186 7 31 0.9637 0.8571 0.9073 0.8871
severus pacbio CMRG 182 0 35 1 0.8387 0.9123 0.5495
delly pacbio CMRG 184 9 33 0.9534 0.8479 0.8976 0.8859
sawfish pacbio CMRG 205 0 12 1 0.9447 0.9716 0.961
dysgu pacbio CMRG 199 3 18 0.9851 0.9171 0.9499 0.9296
sniffles pacbio GIAB 19380 935 9014 0.954 0.6825 0.7957 -
cuteSV pacbio GIAB 19338 1164 9208 0.9432 0.6774 0.7885 -
severus pacbio GIAB 16107 954 11882 0.9441 0.5755 0.7151 -
delly pacbio GIAB 17173 1561 10741 0.9167 0.6152 0.7363 -
sawfish pacbio GIAB 20273 1766 7511 0.9199 0.7297 0.8138 -
dysgu pacbio GIAB 20895 1197 7715 0.9458 0.7303 0.8242 -

Reads were from Oxford Nanopore kit14 (~40X coverage), and PacBio Vega HiFi (~30X coverage). SV callers tested were as follows:

For benchmarking truvari v4.2 was used.

To repeat these results run:

bash fetch_data.sh
nextflow pipeline.nf

Notes

Truvari refine was used on the GIAB benchmark. For the CMRG benchmark, refine was not run as most vcfs triggered errors using this step.

For Severus, a script was used to convert the BND-notation of deletions to symbolic DEL calls. See convert_severus.py for details.

Truvari parameters were:

--passonly -r 1000 --dup-to-ins -p 0

To modify parameters, edit the nextflow.config file.

Contributing

Happy to accept PR's to add other callers to this benchmark, so long as:

  • The caller works with the input data (bam and cram files)
  • The caller requires < 5 hours to process a sample, and consumes < 32 Gb memory
  • The caller can be reliably installed without editing the source code