Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools concat --ligate drops variants when overlap between two VCFs is empty #1567

Closed
freeseek opened this issue Aug 25, 2021 · 4 comments
Closed

Comments

@freeseek
Copy link
Contributor

I have not looked at the vcfconcat.c code yet, but the following behavior seems puzzling.

Generate the following VCFs:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t1\t.\tA\tC\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o A1.vcf.gz
bcftools index -t -f A1.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t1\t.\tA\tC\t.\t.\t.\tGT\t0|1"
echo -e "chr1\t2\t.\tC\tG\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o A2.vcf.gz
bcftools index -t -f A2.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t2\t.\tC\tG\t.\t.\t.\tGT\t1|0"
echo -e "chr1\t3\t.\tG\tT\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o B.vcf.gz
bcftools index -t -f B.vcf.gz

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE"
echo -e "chr1\t3\t.\tG\tT\t.\t.\t.\tGT\t1|0"
echo -e "chr1\t4\t.\tT\tA\t.\t.\t.\tGT\t0|1") | bcftools view -Oz -o C.vcf.gz
bcftools index -t -f C.vcf.gz

All four variants are output when performing a simple concatenation:

$ bcftools concat --allow-overlaps --rm-dups all A1.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	1|0
chr1	3	.	G	T	.	.	.	GT	0|1
chr1	4	.	T	A	.	.	.	GT	0|1

One variant goes missing when the overlap between the first two VCFs is empty:

$ bcftools concat --ligate A1.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT:PS	0|1:1
chr1	3	.	G	T	.	.	.	GT:PS	0|1:1
chr1	4	.	T	A	.	.	.	GT:PS	1|0:1

If the overlap between the first two VCFs is not empty, all variants are retained both with a simple concatenation:

$ bcftools concat --allow-overlaps --rm-dups all A2.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT	0|1
chr1	2	.	C	G	.	.	.	GT	0|1
chr1	3	.	G	T	.	.	.	GT	0|1
chr1	4	.	T	A	.	.	.	GT	0|1

And with a concatenation using phase ligation:

$ bcftools concat --ligate A2.vcf.gz B.vcf.gz C.vcf.gz | bcftools view -H
chr1	1	.	A	C	.	.	.	GT:PS	0|1:1
chr1	2	.	C	G	.	.	.	GT:PS	0|1:1
chr1	3	.	G	T	.	.	.	GT:PS	1|0:1
chr1	4	.	T	A	.	.	.	GT:PS	0|1:1
@pd3
Copy link
Member

pd3 commented Aug 26, 2021

The -l option was designed for VCFs with perfect overlaps and drops sites that break this assumption. A warning should have been printed, adding it now. What was the real life situation this toy case is based on?

@freeseek
Copy link
Contributor Author

The use case was a phasing pipeline where the overlapping windows for the phasing tasks are decided ahead of time, regardless of the list of markers passing quality control. What happened is that in one case the overlap between two consecutive windows turned out to be empty, due to sparsity of the array in a given region of the genome.

I don't know how difficult it would be to make it possible to concatenate without dropping variants when two consecutive windows have no overlap reverting to the non-phase-aware concatenation method. I would not be worried that the phase would not be reliably ligated in that case. Even in the canonical case of a consistent window overlap, there is always the possibility that a given sample is homozygous across the overlap, so the user already has to accept that the ligation can fail in a subset of samples.

As I personally see the primary use of bcftools concat --ligate in phasing pipelines for large cohorts (and it is really a key ingredient) where numerous overlapping windows are separately phased for each chromosome, I think a warning might not be useful as users would be unlikely to check the output of each task (e.g. if you are running one ligation task per chromosome). If the outcome is that variants will be dropped, I would see an error as more appropriate.

@pd3 pd3 closed this as completed in ef5cf0a Aug 28, 2021
@pd3
Copy link
Member

pd3 commented Aug 28, 2021

OK, I just pushed a commit which changes the default behavior to throw an error when non-overlapping chunks or sites present in one chunk but absent in the other are encountered. To drop such sites and proceed, one can use the new --ligate-warn option (previously this was the default, but the warning was missing in certain situations). To keep all sites, use the new --ligate-force option.

Please let me know if you spot anything odd, handling arbitrary overlaps between arbitrary number of overlapping files can be tricky.

@BJWiley233
Copy link

@freeseek I just got this error running MoChA WDL. Should we be adding flag --ligate-force to the vcf_concat step?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants