-
Notifications
You must be signed in to change notification settings - Fork 1
normalization.nf
Ernesto Lowy edited this page Jun 21, 2021
·
8 revisions
This workflow is used for the normalization of variants in a VCF file. This is necessary for example to compare a VCF against a gold standard call set.
This workflow can be used to normalize two variant types: SNPs and INDELs and it basically will run the following commands:
- Split the multiallelic variants using
bcftools norm
:
bcftools norm -m -any $input.vcf.gz -o out.norm.vcf.gz -Oz
- Run vcflib vcfallelicprimitives. If multiple allelic primitives (gaps or mismatches) are specified in a single VCF record, it will split the record into multiple lines:
vcflib vcfallelicprimitives -k -g out.norm.vcf.gz |bgzip -c > out.norm.decomp.vcf.gz
- Select the type of variant to normalize ($vt):
bcftools view out.norm.decomp.vcf.gz -v $vt -o out.norm.decomp.$vt.vcf.gz -Oz
- Sort the resulting VCF
bcftools sort out.norm.decomp.$vt.vcf.gz -o out.norm.decomp.$vt.sort.vcf.gz -Oz
- Drops duplicate variants using vt uniq
vt uniq $vcf | bgzip -c > out.norm.decomp.$vt.sort.uniq.vcf.gz
The best way of running this pipeline is by using the following Docker image: https://hub.docker.com/repository/docker/elowy01/normalization
The configuration file that can be used with this pipeline can be found here
This workflow is implemented in the script named normalization.nf
nextflow -C normalization.config run normalization.nf --vcf file.vcf.gz --vt [snps|indels]
Used to specify VCF file that will be normalized.
Used to specify the variant type that will be normalized ['snps'|'indels']
The normalized VCF file will be placed in a directory named:
results/