-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple characters are not a string... or are they? #631
Comments
The specifications are indeed unclear.
Internally, the BCF format encodes a Character field as a String so if your tooling ignores the The following should be clarified:
The specs do not explicitly state that commas must be used as delimiters. Are they required or optional for Type=Character fields? This is only an issue for
VCF should specify whether Type=Character values correspond to a single UTF-8 code unit, code point, or glyph. It should be code point but I suspect some existing tools will use code unit thus not actually support non-ASCII characters. |
But the spec states:
This clearly mandates that multiple values are comma-separated or not? It does not say
This does have implications though: the
Yeah, I don't even want to go down the UTF8 rabbit hole. Maybe it would just be easier to deprecate |
I think what's confusing here is mixing up VCF with BCF. The bulk of the specification states how the VCF format works. BCF is then a binarisation of that text document. Things which are stated as part of VCF for purposes of readability do not necessarily translate over to BCF. PL fields for example are stored as the array size followed by the encoding numbers. There's obviously no comma separating them in the binary structure. Given an array of characters is different from a string in the VCF parlance, I think it's not too unexpected for them to be treated similar to an array of 8-bit integers, but it is perhaps a little opaque. The BCF section needs to be explicit here. It's rather ambiguous as it states that there are types for MISSING, 8, 16, and 32 bit integers, 32 bit floats, and ASCII characters. But just a few sentences later on it then boldly claims there characters are not explicitly types, despite clearly having a numeric type code associated to them. I suspect this is wrong, but I haven't dived into the history to see if one statement appeared later. |
I see three options:
@lbergelson do you know what htsjdk does for |
I had this totally back to front. I thought from reading this post the problem was that in VCF it's
So BCF is keeping the commas. This does indeed seem at odds with the description, as it's treating it as a string instead of a list of objects. I tried the latest version of Picard, and it also produced "A,B" and "P,Q", although the "." values changed:
Sadly I can't do cross-validation between tools as Picard doesn't support modern BCF and Bcftools doesn't support ancient BCF. Sigh. However it appears both tools take the approach of multiple characters are stored as a string. Picard doesn't round-trip properly and just discards the I tried changing the strings to nonsensical things: more than 2 items, or separated by spaces, semicolons, even brackets, dollars, etc. They're just stored verbatim by both tools. So I think we need to make the spec match the common usage in the wild. Edit: I tried bcftools 0.1.19 which ought to cope with the old BCF format, or so I thought, but it can't read either the VCF nor the BCF, giving totally meaningless errors:
I think I'm not qualified to comment much more on this rat's nest as clearly am out of my depth with the shifting-sands. Edit: So I did some more experiments.
Seriously, can we just nuke the BCF part of the spec and pretend it never existed? :-) What we have now is basically unworkable as an interchange. Each tool and even each version of each tool is essentially using a tool specific format with no thought to interchange. :( |
Yeah, at this point, I think the only viable option is just standardise what htslib is doing. That way, at least we get a concise description of what files in the wild look like, and what most implementations expect. I would also recommend deprecating CHARACTER in favour of STRING. |
That's what I have been trying to say from the beginning, but maybe I wasn't clear about it. |
Ran into the same issue as @h-2. I'm also wondering about the htsjdk behavior. |
I'd like to give this issue a push again, as it's fundamental to the format and breaks interoperability. I believe most of the ambiguity has already been discussed, but a decision needs to be made to move this forward. |
It feels to me that we just need to document how the modern tools interpret it. We declare the format for arrays to be comma separated, but arrays of characters have no separation. This is useful, as otherwise we couldn't encode the character ','. Besides that, implementations need to lead the specifications here given it's always been this way. So my suggestion as a minimal fix would simply be the add a string caveat as mentioned above: #631 (comment) |
To answer the question of what HTSJDK 4.1.0 does, I tested with this input (
For VCF, it returns the value as a string (
I converted the above input using Picard 3.1.1 (htsjdk 4.0.2; BCF 2.1). java -jar picard.jar VcfFormatConverter --REQUIRE_INDEX false --INPUT test.vcf --OUTPUT test.htsjdk.bcf For BCF, HTSJDK returns the value as an array (
This shows, particularly with When encoding, HTSJDK prepends a comma to each character value, e.g.,
This behavior can also be seen in #631 (comment). |
I have a VCF file with the following header line:
So this field is a "list of characters". Per my understanding of the spec, this means that its values shall be VCF-encoded as
A,B
-- note the comma. The file, however, encodes them asAB
(no comma!). And it even encodes "missing values" as..
and not as.,.
or just.
.And bcftools gladly accepts this and can also convert it back and forth through BCF preserving the
..
.Based on what I think the spec is saying, my implementation treats "Character" as a datatype, and if it expects a list of them, it looks for/introduces the proper separator. Should I instead treat "list of character" as "fixed size string" and is an encoding with separator invalid?
The text was updated successfully, but these errors were encountered: