Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop duplicate FORMAT tags and fix an error with dropping large tags #1752

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Feb 22, 2024

The removal of large tags introduced by b49eea4 and 9db565d could not work correctly, the memmove pointers were wrong!

Resolves #1733

This is a fix of the botched pull request #1736

fmt_aux_t *jfmt = &fmt[j];
if ( jfmt->size==-1 ) continue; // already marked for removal
if ( ifmt->key!=jfmt->key ) continue;
static int warned = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Htslib is meant to be thread safe, and also callable multiple times. We shouldn't be using static variables like this as they can lead to thread complaints with automated checking tools (such as two threads modifying the same variable without intervening locks) and also if we parse file A and file B then we'll never report warnings on file B if file A produced one.

Personally, I think we should warn for everything. If it gets too spammy, people can turn down the warning level with the appropriate command line options to set verbosity. (Or just fix their buggy inputs.)

We could however have it as a non-standard function level scope so we warn once for the function as a whole. This reduces spamming a little if the entire string is duplicated.

vcf.c Outdated
v->n_fmt--;
if ( i+1<v->n_fmt ) memmove(&fmt[i],&fmt[i+1],sizeof(*fmt)*v->n_fmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worrying that this never worked before, but the new version also doesn't work and is now infact a security flaw:

$ ./test/test_view poc_simple.vcf
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	X1
[W::vcf_parse_format_alloc4] Duplicate FORMAT tag GT at chr1:10327
*** stack smashing detected ***: <unknown> terminated
Aborted

Note it also didn't work with a FORMAT string of GT:GT:X. That got replaced by GT (no X) because you're decrementing v->n_fmt too early so the if ( i+1<v->n_fmt ) means it doesn't do the copy of the remaining item.

To avoid the buffer overflow, you need to reduce the amount copied:

diff --git a/htscodecs b/htscodecs
--- a/htscodecs
+++ b/htscodecs
@@ -1 +1 @@
-Subproject commit ffda7310c4b3292955561d6c3b1743cb82bfe26b
+Subproject commit ffda7310c4b3292955561d6c3b1743cb82bfe26b-dirty
diff --git a/vcf.c b/vcf.c
index f08dbc07..1ed0f23d 100644
--- a/vcf.c
+++ b/vcf.c
@@ -3297,8 +3297,9 @@ static int vcf_parse_format_gt6(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v,
         while ( i < v->n_fmt ) {
             if ( fmt[i].size==-1 )
             {
+                if ( i+1<v->n_fmt )
+                    memmove(&fmt[i],&fmt[i+1],sizeof(*fmt)*(v->n_fmt-(i+1)));
                 v->n_fmt--;
-                if ( i+1<v->n_fmt ) memmove(&fmt[i],&fmt[i+1],sizeof(*fmt)*v->n_fmt);
             }
             else
                 i++;

The reason this isn't spotted by the address sanitizer in the tests is that we're only testing with 2 format keys, and we always allocate MAX_N_FMT entries. So it's overflowing the array slots we think it's using, but as far as the CPU is concerned it's still keeping within the bounds of the memory we allocated. To exploit it you need to have an entry in slot x where 2x >= 255 (+/- a few perhaps).

My little test file is this:

##fileformat=VCFv4.1
##INFO=<ID=S,Number=1,dype=String,Description="blah">
##INFO=<ID=I,Number=1,Type=Integer,Description="blah">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=a0,Number=1,Type=String>
##FORMAT=<ID=a1,Number=1,Type=String>
##FORMAT=<ID=a2,Number=1,Type=String>
##FORMAT=<ID=a3,Number=1,Type=String>
##FORMAT=<ID=a4,Number=1,Type=String>
##FORMAT=<ID=a5,Number=1,Type=String>
##FORMAT=<ID=a6,Number=1,Type=String>
##FORMAT=<ID=a7,Number=1,Type=String>
##FORMAT=<ID=a8,Number=1,Type=String>
##FORMAT=<ID=a9,Number=1,Type=String>
##FORMAT=<ID=a10,Number=1,Type=String>
##FORMAT=<ID=a11,Number=1,Type=String>
##FORMAT=<ID=a12,Number=1,Type=String>
##FORMAT=<ID=a13,Number=1,Type=String>
##FORMAT=<ID=a14,Number=1,Type=String>
##FORMAT=<ID=a15,Number=1,Type=String>
##FORMAT=<ID=a16,Number=1,Type=String>
##FORMAT=<ID=a17,Number=1,Type=String>
##FORMAT=<ID=a18,Number=1,Type=String>
##FORMAT=<ID=a19,Number=1,Type=String>
##FORMAT=<ID=a20,Number=1,Type=String>
##FORMAT=<ID=a21,Number=1,Type=String>
##FORMAT=<ID=a22,Number=1,Type=String>
##FORMAT=<ID=a23,Number=1,Type=String>
##FORMAT=<ID=a24,Number=1,Type=String>
##FORMAT=<ID=a25,Number=1,Type=String>
##FORMAT=<ID=a26,Number=1,Type=String>
##FORMAT=<ID=a27,Number=1,Type=String>
##FORMAT=<ID=a28,Number=1,Type=String>
##FORMAT=<ID=a29,Number=1,Type=String>
##FORMAT=<ID=a30,Number=1,Type=String>
##FORMAT=<ID=a31,Number=1,Type=String>
##FORMAT=<ID=a32,Number=1,Type=String>
##FORMAT=<ID=a33,Number=1,Type=String>
##FORMAT=<ID=a34,Number=1,Type=String>
##FORMAT=<ID=a35,Number=1,Type=String>
##FORMAT=<ID=a36,Number=1,Type=String>
##FORMAT=<ID=a37,Number=1,Type=String>
##FORMAT=<ID=a38,Number=1,Type=String>
##FORMAT=<ID=a39,Number=1,Type=String>
##FORMAT=<ID=a40,Number=1,Type=String>
##FORMAT=<ID=a41,Number=1,Type=String>
##FORMAT=<ID=a42,Number=1,Type=String>
##FORMAT=<ID=a43,Number=1,Type=String>
##FORMAT=<ID=a44,Number=1,Type=String>
##FORMAT=<ID=a45,Number=1,Type=String>
##FORMAT=<ID=a46,Number=1,Type=String>
##FORMAT=<ID=a47,Number=1,Type=String>
##FORMAT=<ID=a48,Number=1,Type=String>
##FORMAT=<ID=a49,Number=1,Type=String>
##FORMAT=<ID=a50,Number=1,Type=String>
##FORMAT=<ID=a51,Number=1,Type=String>
##FORMAT=<ID=a52,Number=1,Type=String>
##FORMAT=<ID=a53,Number=1,Type=String>
##FORMAT=<ID=a54,Number=1,Type=String>
##FORMAT=<ID=a55,Number=1,Type=String>
##FORMAT=<ID=a56,Number=1,Type=String>
##FORMAT=<ID=a57,Number=1,Type=String>
##FORMAT=<ID=a58,Number=1,Type=String>
##FORMAT=<ID=a59,Number=1,Type=String>
##FORMAT=<ID=a60,Number=1,Type=String>
##FORMAT=<ID=a61,Number=1,Type=String>
##FORMAT=<ID=a62,Number=1,Type=String>
##FORMAT=<ID=a63,Number=1,Type=String>
##FORMAT=<ID=a64,Number=1,Type=String>
##FORMAT=<ID=a65,Number=1,Type=String>
##FORMAT=<ID=a66,Number=1,Type=String>
##FORMAT=<ID=a67,Number=1,Type=String>
##FORMAT=<ID=a68,Number=1,Type=String>
##FORMAT=<ID=a69,Number=1,Type=String>
##FORMAT=<ID=a70,Number=1,Type=String>
##FORMAT=<ID=a71,Number=1,Type=String>
##FORMAT=<ID=a72,Number=1,Type=String>
##FORMAT=<ID=a73,Number=1,Type=String>
##FORMAT=<ID=a74,Number=1,Type=String>
##FORMAT=<ID=a75,Number=1,Type=String>
##FORMAT=<ID=a76,Number=1,Type=String>
##FORMAT=<ID=a77,Number=1,Type=String>
##FORMAT=<ID=a78,Number=1,Type=String>
##FORMAT=<ID=a79,Number=1,Type=String>
##FORMAT=<ID=a80,Number=1,Type=String>
##FORMAT=<ID=a81,Number=1,Type=String>
##FORMAT=<ID=a82,Number=1,Type=String>
##FORMAT=<ID=a83,Number=1,Type=String>
##FORMAT=<ID=a84,Number=1,Type=String>
##FORMAT=<ID=a85,Number=1,Type=String>
##FORMAT=<ID=a86,Number=1,Type=String>
##FORMAT=<ID=a87,Number=1,Type=String>
##FORMAT=<ID=a88,Number=1,Type=String>
##FORMAT=<ID=a89,Number=1,Type=String>
##FORMAT=<ID=a90,Number=1,Type=String>
##FORMAT=<ID=a91,Number=1,Type=String>
##FORMAT=<ID=a92,Number=1,Type=String>
##FORMAT=<ID=a93,Number=1,Type=String>
##FORMAT=<ID=a94,Number=1,Type=String>
##FORMAT=<ID=a95,Number=1,Type=String>
##FORMAT=<ID=a96,Number=1,Type=String>
##FORMAT=<ID=a97,Number=1,Type=String>
##FORMAT=<ID=a98,Number=1,Type=String>
##FORMAT=<ID=a99,Number=1,Type=String>
##FORMAT=<ID=a100,Number=1,Type=String>
##FORMAT=<ID=a101,Number=1,Type=String>
##FORMAT=<ID=a102,Number=1,Type=String>
##FORMAT=<ID=a103,Number=1,Type=String>
##FORMAT=<ID=a104,Number=1,Type=String>
##FORMAT=<ID=a105,Number=1,Type=String>
##FORMAT=<ID=a106,Number=1,Type=String>
##FORMAT=<ID=a107,Number=1,Type=String>
##FORMAT=<ID=a108,Number=1,Type=String>
##FORMAT=<ID=a109,Number=1,Type=String>
##FORMAT=<ID=a110,Number=1,Type=String>
##FORMAT=<ID=a111,Number=1,Type=String>
##FORMAT=<ID=a112,Number=1,Type=String>
##FORMAT=<ID=a113,Number=1,Type=String>
##FORMAT=<ID=a114,Number=1,Type=String>
##FORMAT=<ID=a115,Number=1,Type=String>
##FORMAT=<ID=a116,Number=1,Type=String>
##FORMAT=<ID=a117,Number=1,Type=String>
##FORMAT=<ID=a118,Number=1,Type=String>
##FORMAT=<ID=a119,Number=1,Type=String>
##FORMAT=<ID=a120,Number=1,Type=String>
##FORMAT=<ID=a121,Number=1,Type=String>
##FORMAT=<ID=a122,Number=1,Type=String>
##FORMAT=<ID=a123,Number=1,Type=String>
##FORMAT=<ID=a124,Number=1,Type=String>
##FORMAT=<ID=a125,Number=1,Type=String>
##FORMAT=<ID=a126,Number=1,Type=String>
##FORMAT=<ID=a127,Number=1,Type=String>
##FORMAT=<ID=a128,Number=1,Type=String>
##FORMAT=<ID=a129,Number=1,Type=String>
##FORMAT=<ID=a130,Number=1,Type=String>
##FORMAT=<ID=a131,Number=1,Type=String>
##FORMAT=<ID=a132,Number=1,Type=String>
##FORMAT=<ID=a133,Number=1,Type=String>
##FORMAT=<ID=a134,Number=1,Type=String>
##FORMAT=<ID=a135,Number=1,Type=String>
##FORMAT=<ID=a136,Number=1,Type=String>
##FORMAT=<ID=a137,Number=1,Type=String>
##FORMAT=<ID=a138,Number=1,Type=String>
##FORMAT=<ID=a139,Number=1,Type=String>
##FORMAT=<ID=a140,Number=1,Type=String>
##FORMAT=<ID=a141,Number=1,Type=String>
##FORMAT=<ID=a142,Number=1,Type=String>
##FORMAT=<ID=a143,Number=1,Type=String>
##FORMAT=<ID=a144,Number=1,Type=String>
##FORMAT=<ID=a145,Number=1,Type=String>
##FORMAT=<ID=a146,Number=1,Type=String>
##FORMAT=<ID=a147,Number=1,Type=String>
##FORMAT=<ID=a148,Number=1,Type=String>
##FORMAT=<ID=a149,Number=1,Type=String>
##FORMAT=<ID=a150,Number=1,Type=String>
##FORMAT=<ID=a151,Number=1,Type=String>
##FORMAT=<ID=a152,Number=1,Type=String>
##FORMAT=<ID=a153,Number=1,Type=String>
##FORMAT=<ID=a154,Number=1,Type=String>
##FORMAT=<ID=a155,Number=1,Type=String>
##FORMAT=<ID=a156,Number=1,Type=String>
##FORMAT=<ID=a157,Number=1,Type=String>
##FORMAT=<ID=a158,Number=1,Type=String>
##FORMAT=<ID=a159,Number=1,Type=String>
##FORMAT=<ID=a160,Number=1,Type=String>
##FORMAT=<ID=a161,Number=1,Type=String>
##FORMAT=<ID=a162,Number=1,Type=String>
##FORMAT=<ID=a163,Number=1,Type=String>
##FORMAT=<ID=a164,Number=1,Type=String>
##FORMAT=<ID=a165,Number=1,Type=String>
##FORMAT=<ID=a166,Number=1,Type=String>
##FORMAT=<ID=a167,Number=1,Type=String>
##FORMAT=<ID=a168,Number=1,Type=String>
##FORMAT=<ID=a169,Number=1,Type=String>
##FORMAT=<ID=a170,Number=1,Type=String>
##FORMAT=<ID=a171,Number=1,Type=String>
##FORMAT=<ID=a172,Number=1,Type=String>
##FORMAT=<ID=a173,Number=1,Type=String>
##FORMAT=<ID=a174,Number=1,Type=String>
##FORMAT=<ID=a175,Number=1,Type=String>
##FORMAT=<ID=a176,Number=1,Type=String>
##FORMAT=<ID=a177,Number=1,Type=String>
##FORMAT=<ID=a178,Number=1,Type=String>
##FORMAT=<ID=a179,Number=1,Type=String>
##FORMAT=<ID=a180,Number=1,Type=String>
##FORMAT=<ID=a181,Number=1,Type=String>
##FORMAT=<ID=a182,Number=1,Type=String>
##FORMAT=<ID=a183,Number=1,Type=String>
##FORMAT=<ID=a184,Number=1,Type=String>
##FORMAT=<ID=a185,Number=1,Type=String>
##FORMAT=<ID=a186,Number=1,Type=String>
##FORMAT=<ID=a187,Number=1,Type=String>
##FORMAT=<ID=a188,Number=1,Type=String>
##FORMAT=<ID=a189,Number=1,Type=String>
##FORMAT=<ID=a190,Number=1,Type=String>
##FORMAT=<ID=a191,Number=1,Type=String>
##FORMAT=<ID=a192,Number=1,Type=String>
##FORMAT=<ID=a193,Number=1,Type=String>
##FORMAT=<ID=a194,Number=1,Type=String>
##FORMAT=<ID=a195,Number=1,Type=String>
##FORMAT=<ID=a196,Number=1,Type=String>
##FORMAT=<ID=a197,Number=1,Type=String>
##FORMAT=<ID=a198,Number=1,Type=String>
##FORMAT=<ID=a199,Number=1,Type=String>
##contig=<ID=chr1,length=249250621,assembly=b37>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	X1
chr1	10327	.	T	C	.	.	S=foo;I=12345	a0:a1:a2:a3:a4:a5:a6:a7:a8:a9:a10:a11:a12:a13:a14:a15:a16:a17:a18:a19:a20:a21:a22:a23:a24:a25:a26:a27:a28:a29:a30:a31:a32:a33:a34:a35:a36:a37:a38:a39:a40:a41:a42:a43:a44:a45:a46:a47:a48:a49:a50:a51:a52:a53:a54:a55:a56:a57:a58:a59:a60:a61:a62:a63:a64:a65:a66:a67:a68:a69:a70:a71:a72:a73:a74:a75:a76:a77:a78:a79:a80:a81:a82:a83:a84:a85:a86:a87:a88:a89:a90:a91:a92:a93:a94:a95:a96:a97:a98:a99:a100:a101:a102:a103:a104:a105:a106:a107:a108:a109:a110:a111:a112:a113:a114:a115:a116:a117:a118:a119:a120:a121:a122:a123:a124:a125:a126:a127:a128:a129:a130:a131:a132:a133:a134:a135:a136:a137:a138:a139:a140:a141:a142:a143:a144:a145:a146:a147:a148:a149:a150:a151:a152:a153:a154:a155:a156:a157:a158:a159:a160:a161:a162:a163:a164:a165:a166:a167:a168:a169:a170:a171:a172:a173:a174:a175:a176:a177:a178:a179:a180:a181:a182:a183:a184:a185:a186:a187:a188:a189:a190:a191:a192:a193:a194:a195:a196:a197:a198:GT:GT:a199	a0:a1:a2:a3:a4:a5:a6:a7:a8:a9:a10:a11:a12:a13:a14:a15:a16:a17:a18:a19:a20:a21:a22:a23:a24:a25:a26:a27:a28:a29:a30:a31:a32:a33:a34:a35:a36:a37:a38:a39:a40:a41:a42:a43:a44:a45:a46:a47:a48:a49:a50:a51:a52:a53:a54:a55:a56:a57:a58:a59:a60:a61:a62:a63:a64:a65:a66:a67:a68:a69:a70:a71:a72:a73:a74:a75:a76:a77:a78:a79:a80:a81:a82:a83:a84:a85:a86:a87:a88:a89:a90:a91:a92:a93:a94:a95:a96:a97:a98:a99:a100:a101:a102:a103:a104:a105:a106:a107:a108:a109:a110:a111:a112:a113:a114:a115:a116:a117:a118:a119:a120:a121:a122:a123:a124:a125:a126:a127:a128:a129:a130:a131:a132:a133:a134:a135:a136:a137:a138:a139:a140:a141:a142:a143:a144:a145:a146:a147:a148:a149:a150:a151:a152:a153:a154:a155:a156:a157:a158:a159:a160:a161:a162:a163:a164:a165:a166:a167:a168:a169:a170:a171:a172:a173:a174:a175:a176:a177:a178:a179:a180:a181:a182:a183:a184:a185:a186:a187:a188:a189:a190:a191:a192:a193:a194:a195:a196:a197:a198:0/1:0/1:a199

You should probably try it right up to the maximum supported of 255 to check there are no boundary cases.

@pd3
Copy link
Member Author

pd3 commented Mar 14, 2024

This is now fixed and tested extensively, should be ready to go.

@jkbonfield
Copy link
Contributor

I'll try rebasing and pushing back. Some of these build errors are due to you basing this on an old htslib, which used htscodecs 1.4 instead of the current release (1.6).

@jkbonfield jkbonfield force-pushed the dup-FMT-tags branch 2 times, most recently from 0fd148c to 03899cb Compare March 14, 2024 17:13
pd3 added 2 commits March 14, 2024 17:41
The removal of large tags introduced by b49eea4 and 9db565d could
not work correctly, the memmove pointers were wrong!

Resolves samtools#1733
@jkbonfield jkbonfield merged commit 55cafdc into samtools:develop Mar 14, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcftools corrupts duplicate GT format fields
2 participants