-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend VCF API to distinguish between INS and DEL variant types #1467
Conversation
The change is largerly backward API and ABI compatible unless the var_type flag is queried for equality in the user program. API alternatives for querying these flags is provided.
…eparately Resolves #1704 and depends on samtools/htslib#1467 to be merged
vcf.c
Outdated
@@ -4261,6 +4261,30 @@ int bcf_get_variant_type(bcf1_t *rec, int ith_allele) | |||
if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec); | |||
return rec->d.var[ith_allele].type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was amended to be return rec->d.var[ith_allele].type & 63;
(or some equivalent with a new macro to set the maximum legacy value of 63) then the old API would be 100% backwards compatible.
Obviously that would also need changes to bcf_has_variant_type
to get the value direct instead of calling bcf_get_variant_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at the moment if you try linking bcftools against this branch you get six test failures. It's also arguable that calling the old API shouldn't store values beyond 63 in d.var[].type
or d.var_type
in case anyone tries to access them directly. This does happen a few times in bcftools, although luckily they all seem to use bitwise operators instead of equality checks so would probably still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just force pushed this change
vcf.c
Outdated
inline static int _has_variant_type(int type, int bitmask, enum bcf_variant_match mode) | ||
{ | ||
if ( mode==overlap ) return type & bitmask; | ||
|
||
// VCF_INDEL is always set with VCF_INS and VCF_DEL by bcf_set_variant_type[s], but the bitmask may | ||
// ask for say `VCF_INS` or `VCF_INDEL` only | ||
if ( bitmask&(VCF_INS|VCF_DEL) && !(bitmask&VCF_INDEL) ) type &= ~VCF_INDEL; | ||
else if ( bitmask&VCF_INDEL && !(bitmask&(VCF_INS|VCF_DEL)) ) type &= ~(VCF_INS|VCF_DEL); | ||
|
||
if ( mode==subset ) | ||
{ | ||
if ( ~bitmask & type ) return 0; | ||
else return bitmask & type; | ||
} | ||
return type==bitmask ? type : 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, ideally we shouldn't use functions starting with underscore as they are meant to be system specific (although I'd be very surprised if anything actually checked).
My main concern though is this seems very complex. I don't understand what's going on with the VCF_INS|VCF_DEL
stuff. What's the purpose of treating these fields in a special way before doing the AND checks? It'll also be a nightmare to document all these special cases, and indeed they're not documented in the public header file.
We could just add a 2 line bcf_get_variant_type2
function which returns the full range of bits rather than the legacy truncated-range, and then let the calling code do whatever bit-checks it needs to do.
As an eample:
int bcf_get_variant_type2(bcf1_t *rec, int ith_allele)
{
if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec);
return rec->d.var[ith_allele].type;
}
int bcf_get_variant_type(bcf1_t *rec, int ith_allele)
{
return bcf_get_variant_type2(rec, ith_allele) & 63;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bcf_has_variant_type()
interface has one advantage, which is that it's less likely to be caught out by future changes which add more variant types. Defining a replacement bcf_get_variant_type2()
in exactly the same way as the original would hit the problem that we have with the old one - you can't easily extend it. Adding a bitmask parameter would help with this, as callers would be able to set an expectation as to the types returned.
Whatever interface we end up with, it would also be useful to optionally return the d.var[].n
or d.n_var
values, as currently they can only be reached by accessing the data structure directly. This could be done by adding an int *
parameter (which could be NULL if you're not interested in the value).
Another issue that would be worth considering is that neither the old or proposed interfaces can return an error code. It would be useful to be able to do that so we can eventually propagate errors up from bcf_set_variant_types()
which attempts to allocate memory, which could fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is explained in the comment. Please re-read, I really don't know how to explain this better :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment explains what it does, but not why it does it.
I also disagree that this is necessary in order to protect against future changes. We don't have a similar API for BAM flags for example. Instead it's documented that it is a clear bit-field and not something you should be doing naive equality checks against.
The correct way of handling this is via better documentation of the structure and API, instead of inventing a completely new API to do "and" and "or" checks when the language has them built in anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says "why". The user may want to ask for "is this an indel" or "is this an insertion". The rest is just following simple logic. This would of course not be required if you did not insist on non-breaking change. But if VCF_INDEL has to be preserved as a specific value rather than a combination of VCF_INS|VCF_DEL bits, this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand.
I'm not saying your solution is wrong, but the usual way I see of doing this is to return the bit-field as-is and then have macros or static inline functions for interpreting it in a friendly fashion. Eg see the inode(7) man page.
The stat.st_mode
field can be manually checked with AND/OR ops (eg if ((sb.st_mode & S_IFMT) == S_IFBLK) {...}
, but there are also function-a-likes such as if (S_IFBLK(st_mode)) {...}
to do the same thing. If we wanted something compound, such as either character device or block device, then the logic could be (sb_st.mode & (S_IFBLK|S_IFCHR)) == (S_IFBLK|S_IFCHR)
which can be made into e.g. S_IFDEV()
function.
The analogue here would be a VCF_IS_INS()
, VCF_IS_DEL()
and VCF_IS_INDEL()
functions which take the type field and apply appropriate boolean logic to return true or false. That's sort of what I was expecting mainly due to familiarity with POSIX interfaces, but I can see combined the match type (overlap, subset, exact) and bitmask into a single API achieves the same goal. So as I say it's neither right nor wrong, but something to consider. It's possible the POSIX style may be more in keeping with existing htslib APIs (eg bcf_gt_is_missing
, bcf_gt_is_phased
, bcf_float_is_vector_end
, bam_is_rev
etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy for this to be changed to whatever you think will be most convenient
When the old bcf_get_variant_type[s] functions are used, the values stored and returned are identical to the old interface.
I've been looking at potentially changing this to use VCF_IS_XYZ style macros, like the posix stat(2) system call uses (aka inode(7)), or our I've reviewed the bcftools code that is using these fields to see how many honour their bit-field status and how many are (wrongly?) doing direct equality checks. My summary is:
So we see quite a few do direct equality checks while other can cope with bit-masking. Most perculiar is vcfview.c: https://github.com/samtools/bcftools/blob/develop/vcfview.c#L188-L195 All of these are e.g. I initially thought maybe it's because of VCF_REF==0, but then that's not going to work as 0 can't be something you ever compare against. VCF_REF is simply the absence of everything else. Furthermore, REF is permitted in the filter language, but it does the wrong thing:
REF => OTHER. Is that a bug? What does OTHER normally mean? (I see one case of setting OTHER is when ALT is "NON_REF", so clearly there is something not quite right here) As for the various EQUAL checks, some seem ok (plugins/fixref.c), while others are more dubious (https://github.com/samtools/bcftools/blob/develop/plugins/smpl-stats.c#L399). To review this PR needs knowledge of what types can coexist I guess. If VCF has "A G,AT" would type get set to both SNP and INS? It feels like it should as both exist. Is it justified for tools to discard such lines as non-SNPs? Maybe fixref as it's hard to work, but for stats maybe not. What about vcfnorm.c which rejects VCF_BND when it's the only type, but not if it is combined with anything else. Is that an error? I suspect many of the equality checks could work as well, if not better, in a more appropriate bit-mask fashion. |
More thoughts on this. We have For I'm thinking therefore:
I could be convinced also of |
I see a bit more here now. Note there are also other situations where With hindsight it was probably a mistake to use OTHER as the gVCF matches-ref state given the catch-all nature of it. Just as we've split indel up into INS and DEL, I could imagine in future splitting OTHER into the various categories that produce it, only one of which is the <*> smbollic allele. So we need to be cogniscent of this and protect ourselves from future changes here. |
I reviewed all the occurrences and they all are fine except the "unseen" gVCF symbolic star allele ( Unrelated to that, it was suggested that when the old
I am not sure if it's worth it. What do people think? |
My point raised in the meeting wasn't that the equality checks were incorrect, but that some of them could have been bit checks instead. Basically anything working on a single allele. Eg smpl-stats.c:
This is a single allele (type and not types) so only ever has one bit set. Hence As To this end, it's worth adding such commentary into the function documentation so people reading the header files know not to do equality checking unless there is absolutely no other alternative. |
@jkbonfield That's a code in bcftools and it already is dealt with by switching to the new API in the pull request samtools/bcftools#1747
|
So, I think the conclusion on this was that we're going to go with the
|
I've had a go at implementing the suggested changes here, if anyone wants a look - the updates I made are in commit 7c3c29703. The corresponding changes to samtools/bcftools#1747 are in this branch. Looking at what I did there, I think the updated |
* Remove `mode` from bcf_has_variant_type() interface Individual alleles only have a single variant type, so the only useful mode is the overlap one (bitwise-and). * Put `bcf_match_` prefix on enumerated values, to avoid name clashes * Make `bitmask` unsigned for more predictable bitwise operations (the return value still has to be signed, though). * Return -1 if bcf_set_variant_types() fails, of if the requested allele is not valid. * Add a bcf_variant_length() function, to more easily access the rec->d.var[].n field. * Be more specific on specifying the mask used to restrict the types the old functions return, in case more are added later. * Improve documentation in the header.
Comments added to commit 7c3c297 directly so they're along side the relevant code. |
Regarding @daviesrob's bcftools branch, I see lots of:
being replaced with
If that's how we're using it in practice, and replacing one API with the other, then it begs the question of whether we're correct to deprecate the old Edit: Oops I recall the difference now: the old one is actually equivalent to |
* Remove `mode` from bcf_has_variant_type() interface, and add a special case for `VCF_REF` Individual alleles only have a single variant type, so the only useful mode is the overlap one (bitwise-and). The exception is VCF_REF, which is encoded as 0, so has to be tested for by equality. * Put `bcf_match_` prefix on enumerated values, to avoid name clashes * Make `bitmask` unsigned for more predictable bitwise operations (the return value still has to be signed, though). * Return -1 if bcf_set_variant_types() fails, of if the requested allele is not valid. As callers using the legacy API won't be checking for a -1 return, these unfortunately need to be made to call exit(1) on failure. This is however an improvement on what would have happened under the same conditions before, which would most likely have been a NULL pointer dereference. * Add a bcf_variant_length() function, to more easily access the rec->d.var[].n field. * Be more specific on specifying the mask used to restrict the types the old functions return, in case more are added later. * Improve documentation in the header.
Change rockylinux docker image to rockylinux:9 following deprecation of rockylinux:latest. Add perl-FindBin to installation list, as it's now in its own package.
I've pushed my suggested updates, with adjustments following comments, as a new commit. I've also made a small change to |
Rebased and merged as 8d91938 |
…eparately Resolves #1704 and depends on samtools/htslib#1467 to be merged
The change is largely backward API and ABI compatible unless the var_type flag is queried for equality in the user program, rather than used as a bitmask.
API alternatives for querying these flags in a robust way are provided.
This is linked to #1454 and samtools/bcftools#1704