Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple characters are not a string... or are they? #631

Open
h-2 opened this issue Feb 22, 2022 · 11 comments
Open

Multiple characters are not a string... or are they? #631

h-2 opened this issue Feb 22, 2022 · 11 comments
Labels

Comments

@h-2
Copy link

h-2 commented Feb 22, 2022

I have a VCF file with the following header line:

##FORMAT=<ID=RNC, Number=2,Type=Character,Description="...">

So this field is a "list of characters". Per my understanding of the spec, this means that its values shall be VCF-encoded as A,B -- note the comma. The file, however, encodes them as AB (no comma!). And it even encodes "missing values" as .. and not as .,. or just ..
And bcftools gladly accepts this and can also convert it back and forth through BCF preserving the ...

Based on what I think the spec is saying, my implementation treats "Character" as a datatype, and if it expects a list of them, it looks for/introduces the proper separator. Should I instead treat "list of character" as "fixed size string" and is an encoding with separator invalid?

@d-cameron
Copy link
Contributor

The specifications are indeed unclear.

bcftools gladly accepts this and can also convert it back and forth through BCF preserving the ...

Internally, the BCF format encodes a Character field as a String so if your tooling ignores the Type field, or just treats Type=Character as a single String (since that's how it's encoded in BCF) then yes, Character lists will be treated as Number=1 Strings.

The following should be clarified:

  • 1.6.1.8. "literal commas (‘,’) are permitted only as delimiters for lists of values"

The specs do not explicitly state that commas must be used as delimiters. Are they required or optional for Type=Character fields? This is only an issue for Type=Character fields as the lack of commas causes ambiguity for all other list types.

  • 1.2 "The character encoding of VCF files is UTF-8" / 6.3.3 "ASCII encoded in 8 bits"

VCF should specify whether Type=Character values correspond to a single UTF-8 code unit, code point, or glyph. It should be code point but I suspect some existing tools will use code unit thus not actually support non-ASCII characters.

@h-2
Copy link
Author

h-2 commented Feb 23, 2022

The specs do not explicitly state that commas must be used as delimiters.

But the spec states:

Each subfield consists of a short key with optional values in the format:
key[=value[, . . . ,value]]

This clearly mandates that multiple values are comma-separated or not? It does not say key[=value[[,] . . . [,]value]] ...
But there is no point arguing this as all implementations need to follow what htslib does and not what the spec says to be compatible.
I would still really like if this could be clarified in some simple way, a la Multiple values are delimited by "," except if the Type is Character. Multiple Characters are not delimited and may be treated as a single String by implementations..

Character lists will be treated as Number=1 Strings.

This does have implications though: the .. I mentioned above will not be treated as "two missing values" or "one missing value", it will instead be treated as "a string with the value '..'".

VCF should specify whether Type=Character values correspond to a single UTF-8 code unit

Yeah, I don't even want to go down the UTF8 rabbit hole. Maybe it would just be easier to deprecate Type=Character and tell people to use String instead?

@jmarshall jmarshall added the vcf label Feb 23, 2022
@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 24, 2022

I think what's confusing here is mixing up VCF with BCF. The bulk of the specification states how the VCF format works. BCF is then a binarisation of that text document. Things which are stated as part of VCF for purposes of readability do not necessarily translate over to BCF. PL fields for example are stored as the array size followed by the encoding numbers. There's obviously no comma separating them in the binary structure.

Given an array of characters is different from a string in the VCF parlance, I think it's not too unexpected for them to be treated similar to an array of 8-bit integers, but it is perhaps a little opaque. The BCF section needs to be explicit here.

It's rather ambiguous as it states that there are types for MISSING, 8, 16, and 32 bit integers, 32 bit floats, and ASCII characters. But just a few sentences later on it then boldly claims there characters are not explicitly types, despite clearly having a numeric type code associated to them. I suspect this is wrong, but I haven't dived into the history to see if one statement appeared later.

@d-cameron
Copy link
Contributor

I see three options:

  1. Leave specs as-is, add test cases indicating that character fields should be separated by commas. Breaks htslib
  2. Allow commas to be optional and leave it to the parser to handle char fields with and without commas. Breaks tools using htslib on files with commas.
  3. State that character fields must not include comma separators. Compatible with htslib.

@lbergelson do you know what htsjdk does for Type=Character fields?

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 29, 2022

I had this totally back to front. I thought from reading this post the problem was that in VCF it's A,B while in BCF it's AB, but it's not:

@ seq4c[samtools.../bcftools]; cat /tmp/view.vcf 
##fileformat=VCFv4.3
##contig=<ID=20,length=99999999>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=RNC, Number=2,Type=Character,Description="...">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001NA00002	NA00003
20	12345	.	T	C	999	PASS	DP=99	RNC	A,B	P,Q	.,.
@ seq4c[samtools.../bcftools]; bcftools view -Ou /tmp/view.vcf |hd
00000000  42 43 46 02 02 df 01 00  00 23 23 66 69 6c 65 66  |BCF......##filef|
00000010  6f 72 6d 61 74 3d 56 43  46 76 34 2e 33 0a 23 23  |ormat=VCFv4.3.##|
00000020  46 49 4c 54 45 52 3d 3c  49 44 3d 50 41 53 53 2c  |FILTER=<ID=PASS,|
...
000001f0  00 00 00 00 38 30 00 00  01 00 00 00 00 c0 79 44  |....80........yD|
00000200  01 00 02 00 03 00 00 01  07 17 54 17 43 11 00 11  |..........T.C...|
00000210  01 11 63 11 02 37 41 2c  42 50 2c 51 2e 2c 2e     |..c..7A,BP,Q.,.|

So BCF is keeping the commas. This does indeed seem at odds with the description, as it's treating it as a string instead of a list of objects.

I tried the latest version of Picard, and it also produced "A,B" and "P,Q", although the "." values changed:

@ seq4c[samtools.../bcftools]; \java -jar picard-2.26.11.jar VcfFormatConverter I=/tmp/view.vcf O=/tmp/view.bcf REQUIRE_INDEX=false 2>/dev/null;hd /tmp/view.bcf
00000000  42 43 46 02 01 fb 00 00  00 23 23 66 69 6c 65 66  |BCF......##filef|
00000010  6f 72 6d 61 74 3d 56 43  46 76 34 2e 32 0a 23 23  |ormat=VCFv4.2.##|
00000020  46 4f 52 4d 41 54 3d 3c  49 44 3d 52 4e 43 2c 4e  |FORMAT=<ID=RNC,N|
00000030  75 6d 62 65 72 3d 32 2c  54 79 70 65 3d 43 68 61  |umber=2,Type=Cha|
00000040  72 61 63 74 65 72 2c 44  65 73 63 72 69 70 74 69  |racter,Descripti|
00000050  6f 6e 3d 22 2e 2e 2e 22  3e 0a 23 23 49 4e 46 4f  |on="...">.##INFO|
00000060  3d 3c 49 44 3d 44 50 2c  4e 75 6d 62 65 72 3d 31  |=<ID=DP,Number=1|
00000070  2c 54 79 70 65 3d 49 6e  74 65 67 65 72 2c 44 65  |,Type=Integer,De|
00000080  73 63 72 69 70 74 69 6f  6e 3d 22 52 61 77 20 72  |scription="Raw r|
00000090  65 61 64 20 64 65 70 74  68 22 3e 0a 23 23 63 6f  |ead depth">.##co|
000000a0  6e 74 69 67 3d 3c 49 44  3d 32 30 2c 6c 65 6e 67  |ntig=<ID=20,leng|
000000b0  74 68 3d 39 39 39 39 39  39 39 39 3e 0a 23 43 48  |th=99999999>.#CH|
000000c0  52 4f 4d 09 50 4f 53 09  49 44 09 52 45 46 09 41  |ROM.POS.ID.REF.A|
000000d0  4c 54 09 51 55 41 4c 09  46 49 4c 54 45 52 09 49  |LT.QUAL.FILTER.I|
000000e0  4e 46 4f 09 46 4f 52 4d  41 54 09 4e 41 30 30 30  |NFO.FORMAT.NA000|
000000f0  30 31 09 4e 41 30 30 30  30 32 09 4e 41 30 30 30  |01.NA00002.NA000|
00000100  30 33 0a 00 24 00 00 00  0f 00 00 00 00 00 00 00  |03..$...........|
00000110  38 30 00 00 01 00 00 00  00 c0 79 44 01 00 02 00  |80........yD....|
00000120  03 00 00 01 17 2e 17 54  17 43 11 00 11 02 11 63  |.......T.C.....c|
00000130  11 01 47 2c 41 2c 42 2c  50 2c 51 00 00 00 00     |..G,A,B,P,Q....|

Sadly I can't do cross-validation between tools as Picard doesn't support modern BCF and Bcftools doesn't support ancient BCF. Sigh. However it appears both tools take the approach of multiple characters are stored as a string. Picard doesn't round-trip properly and just discards the .,. part, leading to an incorrect number of FORMAT fields.

I tried changing the strings to nonsensical things: more than 2 items, or separated by spaces, semicolons, even brackets, dollars, etc. They're just stored verbatim by both tools.

So I think we need to make the spec match the common usage in the wild.

Edit: I tried bcftools 0.1.19 which ought to cope with the old BCF format, or so I thought, but it can't read either the VCF nor the BCF, giving totally meaningless errors:

[bcf_sync] incorrect number of fields (0 != 5) at 0:0

I think I'm not qualified to comment much more on this rat's nest as clearly am out of my depth with the shifting-sands.

Edit: So I did some more experiments.

  • Bcftools (current) cannot read Picard (current) BCFs.
  • Picard cannot read Bcftools (current) BCFs
  • Bcftools (current) cannot read Bcftools-0.1.19 BCFs.
  • Picard cannot read Bcftools-0.1.19 BCFs
  • Bcftools-0.1.19 cannot read Bcftools (current) BCFs
  • Bcftools-0.1.19 cannot read Picard BCFs.
  • Bcftools-0.1.19 cannot read Bcftools-0.1.19 BCFs! (It just SEGVs, but maybe I sacrificed the wrong chicken or something).

Seriously, can we just nuke the BCF part of the spec and pretend it never existed? :-) What we have now is basically unworkable as an interchange. Each tool and even each version of each tool is essentially using a tool specific format with no thought to interchange. :(

@h-2
Copy link
Author

h-2 commented Mar 31, 2022

State that character fields must not include comma separators. Compatible with htslib.

Yeah, at this point, I think the only viable option is just standardise what htslib is doing. That way, at least we get a concise description of what files in the wild look like, and what most implementations expect. I would also recommend deprecating CHARACTER in favour of STRING.

@h-2
Copy link
Author

h-2 commented Mar 31, 2022

I thought from reading this post the problem was that in VCF it's A,B while in BCF it's AB, but it's not:

That's what I have been trying to say from the beginning, but maybe I wasn't clear about it.

@dennishendriksen
Copy link

Ran into the same issue as @h-2. I'm also wondering about the htsjdk behavior.
@lbergelson Could you take a look at the question from @d-cameron? Thanks in advance.
Would be great if the VCF specification could be clarify the accepted values for type CHARACTER.

@zaeleus
Copy link

zaeleus commented Feb 21, 2024

I'd like to give this issue a push again, as it's fundamental to the format and breaks interoperability. I believe most of the ambiguity has already been discussed, but a decision needs to be made to move this forward.

@jkbonfield
Copy link
Contributor

It feels to me that we just need to document how the modern tools interpret it. We declare the format for arrays to be comma separated, but arrays of characters have no separation. This is useful, as otherwise we couldn't encode the character ','. Besides that, implementations need to lead the specifications here given it's always been this way.

So my suggestion as a minimal fix would simply be the add a string caveat as mentioned above: #631 (comment)

@zaeleus
Copy link

zaeleus commented Apr 10, 2024

To answer the question of what HTSJDK 4.1.0 does, I tested with this input (test.vcf):

##fileformat=VCFv4.2
##contig=<ID=sq0>
##FORMAT=<ID=RNC,Number=2,Type=Character,Description="...">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2	sample3	sample4	sample5
sq0	1	.	N	.	.	.	.	RNC	a,b	.,c	.,.	de	.

For VCF, it returns the value as a string (java.lang.String) or null when the value matches ..

a,b
.,c
.,.
de
null

I converted the above input using Picard 3.1.1 (htsjdk 4.0.2; BCF 2.1).

java -jar picard.jar VcfFormatConverter --REQUIRE_INDEX false --INPUT test.vcf --OUTPUT test.htsjdk.bcf

For BCF, HTSJDK returns the value as an array (java.util.Arrays$ArrayList) if there are many comma-separated values, a string (java.lang.String) if there is a single value, or null when all comma-separated values are missing. (Edit: Also note how it drops inner missing values.)

[a, b]
c
null
de
null

This shows, particularly with .,., that HTSJDK expects BCF character values to be comma-separated.

When encoding, HTSJDK prepends a comma to each character value, e.g., [a, b] becomes ,a,b, which is quite weird.

$ xxd --seek 223 test.htsjdk.bcf
000000df: 472c 612c 622c 6300 0000 0000 0000 0000  G,a,b,c.........
000000ef: 00

This behavior can also be seen in #631 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

6 participants