Skip to content

Ernesto Lowy edited this page Jun 21, 2021 · 8 revisions

This workflow is used for the normalization of variants in a VCF file. This is necessary for example to compare a VCF against a gold standard call set.

This workflow can be used to normalize two variant types: SNPs and INDELs and it basically will run the following commands:

  • Split the multiallelic variants using bcftools norm:
bcftools norm -m -any $input.vcf.gz -o out.norm.vcf.gz -Oz
  • Run vcflib vcfallelicprimitives. If multiple allelic primitives (gaps or mismatches) are specified in a single VCF record, it will split the record into multiple lines:
vcflib vcfallelicprimitives -k -g out.norm.vcf.gz |bgzip -c > out.norm.decomp.vcf.gz
  • Select the type of variant to normalize ($vt):
bcftools view out.norm.decomp.vcf.gz -v $vt -o out.norm.decomp.$vt.vcf.gz -Oz
  • Sort the resulting VCF
bcftools sort out.norm.decomp.$vt.vcf.gz -o out.norm.decomp.$vt.sort.vcf.gz -Oz
  • Drops duplicate variants using vt uniq
vt uniq $vcf | bgzip -c > out.norm.decomp.$vt.sort.uniq.vcf.gz

Docker image

The best way of running this pipeline is by using the following Docker image:

Nextflow configuration file

The configuration file that can be used with this pipeline can be found here

Nextflow workflow

This workflow is implemented in the script named


nextflow -C normalization.config run --vcf file.vcf.gz --vt [snps|indels]



Used to specify VCF file that will be normalized.


Used to specify the variant type that will be normalized ['snps'|'indels']


The normalized VCF file will be placed in a directory named:

Clone this wiki locally