Skip to content

normalization.nf

Ernesto Lowy edited this page Jun 21, 2021 · 8 revisions

This workflow is used for the normalization of variants in a VCF file. This is necessary for example to compare a VCF against a gold standard call set.

This workflow can be used to normalize two variant types: SNPs and INDELs and it basically will run the following commands:

  • Split the multiallelic variants using bcftools norm:
bcftools norm -m -any $input.vcf.gz -o out.norm.vcf.gz -Oz
  • Run vcflib vcfallelicprimitives. If multiple allelic primitives (gaps or mismatches) are specified in a single VCF record, it will split the record into multiple lines:
vcflib vcfallelicprimitives -k -g out.norm.vcf.gz |bgzip -c > out.norm.decomp.vcf.gz
  • Select the type of variant to normalize ($vt):
bcftools view out.norm.decomp.vcf.gz -v $vt -o out.norm.decomp.$vt.vcf.gz -Oz
  • Sort the resulting VCF
bcftools sort out.norm.decomp.$vt.vcf.gz -o out.norm.decomp.$vt.sort.vcf.gz -Oz
  • Drops duplicate variants using vt uniq
vt uniq $vcf | bgzip -c > out.norm.decomp.$vt.sort.uniq.vcf.gz

Docker image

The best way of running this pipeline is by using the following Docker image: https://hub.docker.com/repository/docker/elowy01/normalization

Nextflow configuration file

The configuration file that can be used with this pipeline can be found here

Nextflow workflow

This workflow is implemented in the script named normalization.nf

Usage

nextflow -C normalization.config run normalization.nf --vcf file.vcf.gz --vt [snps|indels]

Parameters

--vcf

Used to specify VCF file that will be normalized.

--vt

Used to specify the variant type that will be normalized ['snps'|'indels']

Output

The normalized VCF file will be placed in a directory named:
results/

Clone this wiki locally