Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atomize option in bcftools norm introduces funy characters in the INFO column #1472

Closed
ramprasadn opened this issue Apr 22, 2021 · 8 comments

Comments

@ramprasadn
Copy link

I have noticed that bcftools norm ends up introducing strange characters in the INFO column when the atomize option is turned on.

Here is a sample of entries from the unprocessed vcf,

1       1284490 rs150789461     G       A       779.31  PASS    AC=6;AF=1.00;AN=6;DB;DP=20;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ     GT:AD:DP:GQ:PL  1/1:0,6:6:18:269,18,0   1/1:0,7:7:21:300,21,0   1/1:0,5:5:15:224,15,0
1       1284878 rs111643738     A       C       2170.73 PASS    AC=6;AF=1.00;AN=6;DB;DP=49;ExcessHet=3.0103;FS=0.000;MLEAC=6;MLEAF=1.00;MQ=60.00;QD=28.20;SOR=1.473;VQSLOD=6.47;culprit=MQ      GT:AD:DP:GQ:PGT:PID:PL:PS       1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841 1|1:0,11:11:33:1|1:1284841_A_G:474,33,0:1284841 1|1:0,19:19:57:1|1:1284841_A_G:855,57,0:1284841
1       1285358 rs11587525      A       G       1089.94 PASS    AC=4;AF=1.00;AN=4;BaseQRankSum=1.65;DB;DP=59;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=58.97;MQRankSum=-1.260e-01;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=26.00;ReadPosRankSum=-1.645e+00;SOR=0.648;VQSLOD=-3.275e-01;culprit=MQ       GT:AD:DP:GQ:PGT:PID:PL:PS       1/1:0,21:21:63:.:.:851,63,0     ./.:27,0:27:.:.:.:0,0,0 1|1:1,8:10:1:1|1:1285358_A_G:250,1,0:1285358
1       1285381 .       G       C       178.23  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-1.570e-01;NEGATIVE_TRAIN_SITE;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-5.838e-01;culprit=MQ        GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:13,0:13:4:.:.:0,4,392       0/0:20,0:20:0:.:.:0,0,279       0|1:1,6:8:5:0|1:1285358_A_G:185,0,5:1285358
1       1286821 rs61766214      G       C       585.35  PASS    AC=2;AF=1.00;AN=2;DB;DP=60;ExcessHet=3.0103;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ  GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:23,0:23:.:.:.:0,0,0 ./.:21,0:21:.:.:.:0,0,0 1|1:0,16:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753

Post-processing with bcftools, the same entries look like below (Note the fields DB, POSITIVE_TRAIN_SITE, and NEGATIVE_TRAIN_SITE)

bcftools norm --output ./justhusky_norm.vcf --output-type v --atomize --fasta-ref grch37_hs-.fasta ./justhusky.SNV.vcf

1       1284490 rs150789461     G       A       779.31  PASS    AC=6;AF=1;AN=6;DB=^F;DP=20;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=25.36;SOR=5.892;VQSLOD=11.34;culprit=MQ       GT:AD:DP:GQ:PL  1/1:0,0:6:18:269,18,0   1/1:0,0:7:21:300,21,0   1/1:0,0:5:15:224,15,0
1       1284878 rs111643738     A       C       2170.73 PASS    AC=6;AF=1;AN=6;DB=^F;DP=49;ExcessHet=3.0103;FS=0;MLEAC=6;MLEAF=1;MQ=60;QD=28.2;SOR=1.473;VQSLOD=6.47;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL:PS       1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841  1|1:0,0:11:33:1|1:1284841_A_G:474,33,0:1284841  1|1:0,0:19:57:1|1:1284841_A_G:855,57,0:1284841
1       1285358 rs11587525      A       G       1089.94 PASS    AC=4;AF=1;AN=4;BaseQRankSum=1.65;DB=33<D3>?F<9B>^S;DP=59;ExcessHet=3.0103;FS=0;MLEAC=4;MLEAF=1;MQ=58.97;MQRankSum=-0.126;NEGATIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;POSITIVE_TRAIN_SITE=%^F^A<BE>F<9B>^S;QD=26;ReadPosRankSum=-1.645;SOR=0.648;VQSLOD=-0.3275;culprit=MQ      GT:AD:DP:GQ:PGT:PID:PL:PS       1/1:0,0:21:63:.:.:851,63,0:.    ./.:27,27:27:.:.:.:0,0,0:.
      1|1:1,1:10:1:1|1:1285358_A_G:250,1,0:1285358
1       1285381 .       G       C       178.23  PASS    AC=1;AF=0.167;AN=6;BaseQRankSum=0.157;DP=42;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.167;MQ=58.36;MQRankSum=-0.157;NEGATIVE_TRAIN_SITE=<9C><C4> <BE>;QD=25.46;ReadPosRankSum=0.157;SOR=2.833;VQSLOD=-0.5838;culprit=MQ     GT:AD:DP:GQ:PGT:PID:PL:PS       0/0:13,13:13:4:.:.:0,4,392:.    0/0:20,20:20:0:.:.:0,0,279:.    0|1:1,1:8:5:0|1:1285358_A_G:185,0,5:1285358
1       1286821 rs61766214      G       C       585.35  PASS    AC=2;AF=1;AN=2;DB=^B;DP=60;ExcessHet=3.0103;FS=0;MLEAC=3;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=27.51;SOR=0.693;VQSLOD=8.42;culprit=MQ    GT:AD:DP:GQ:PGT:PID:PL:PS       ./.:23,23:23:.:.:.:0,0,0:.      ./.:21,21:21:.:.:.:0,0,0:.      1|1:0,0:16:48:1|1:1286753_ACGGGGGGAG_A:596,48,0:1286753

I have also noticed two things about this behaviour

  1. It seems to only affect fields that are described as boolean in the header.
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
  1. It is not consistent throughout the file as some of the boolean entries do look okay post processing. For instance, DB field at the position 1286821 looks funny, but POSITIVE_TRAIN_SITE at the same position looks okay.

Is there a workaround to fix this? Any help would be appreciated.

Thanks,

@pd3
Copy link
Member

pd3 commented Apr 26, 2021

Thank you for testing, this feature is still experimental. Can you please confirm this is still a problem with the latest github version? If yes, is there any chance you could provide a small test case to reproduce the problem? I.e. the full VCF, including the header.

@ramprasadn
Copy link
Author

Hi pd3,

Following your suggestion, I tried the atomize option in the the latest github version on develop, and it still doesn't work. Here is the original vcf including the header, justhusky-selected-regions_gent_vrecal.SNV.vcf.zip

Let me know if you need anything more from my side.

Thanks,

@pd3 pd3 closed this as completed in c376591 Apr 27, 2021
@pd3
Copy link
Member

pd3 commented Apr 27, 2021

Thank you for the test case, this is now fixed.

@ramprasadn
Copy link
Author

Hi pd3, thanks for your prompt response! So if I try the latest github version on develop branch, it should work? I am asking because I did that, and it doesn't work so I am wondering if I should be trying out a different branch? :)

@pd3
Copy link
Member

pd3 commented Apr 28, 2021

It's on the develop, the default github branch. Does the bcftools --version shows the updated version string?

@ramprasadn
Copy link
Author

I believe I am using the latest version on github
version

@pd3
Copy link
Member

pd3 commented Apr 28, 2021

No, that's not the latest, the commit c376591 came after. Please do git pull && make to update

@ramprasadn
Copy link
Author

Ah! Sorry about that. It works now. Thanks a ton, Petr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants