normalization.nf

This workflow is used for the normalization of variants in a VCF file. This is necessary for example to compare a VCF against a gold standard call set.

This workflow can be used to normalize two variant types: SNPs and INDELs and it basically will run the following commands:

Split the multiallelic variants using bcftools norm:

bcftools norm -m -any $input.vcf.gz -o out.norm.vcf.gz -Oz

Run vcflib vcfallelicprimitives. If multiple allelic primitives (gaps or mismatches) are specified in a single VCF record, it will split the record into multiple lines:

vcflib vcfallelicprimitives -k -g out.norm.vcf.gz |bgzip -c > out.norm.decomp.vcf.gz

Select the type of variant to normalize ($vt):

bcftools view out.norm.decomp.vcf.gz -v $vt -o out.norm.decomp.$vt.vcf.gz -Oz

Sort the resulting VCF

bcftools sort out.norm.decomp.$vt.vcf.gz -o out.norm.decomp.$vt.sort.vcf.gz -Oz

Drops duplicate variants using vt uniq

vt uniq $vcf | bgzip -c > out.norm.decomp.$vt.sort.uniq.vcf.gz

Docker image

The best way of running this pipeline is by using the following Docker image: https://hub.docker.com/repository/docker/elowy01/normalization

Nextflow configuration file

The configuration file that can be used with this pipeline can be found here

Nextflow workflow

This workflow is implemented in the script named normalization.nf

Usage

nextflow -C normalization.config run normalization.nf --vcf file.vcf.gz --vt [snps|indels]

Parameters

--vcf

Used to specify VCF file that will be normalized.

--vt

Used to specify the variant type that will be normalized ['snps'|'indels']

Output

The normalized VCF file will be placed in a directory named:
results/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly