Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend VCF API to distinguish between INS and DEL variant types #1467

Closed
wants to merge 4 commits into from

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Jul 6, 2022

The change is largely backward API and ABI compatible unless the var_type flag is queried for equality in the user program, rather than used as a bitmask.

API alternatives for querying these flags in a robust way are provided.

This is linked to #1454 and samtools/bcftools#1704

The change is largerly backward API and ABI compatible unless
the var_type flag is queried for equality in the user program.
API alternatives for querying these flags is provided.
pd3 added a commit to samtools/bcftools that referenced this pull request Jul 6, 2022
vcf.c Outdated
@@ -4261,6 +4261,30 @@ int bcf_get_variant_type(bcf1_t *rec, int ith_allele)
if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec);
return rec->d.var[ith_allele].type;
Copy link
Contributor

@jkbonfield jkbonfield Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was amended to be return rec->d.var[ith_allele].type & 63; (or some equivalent with a new macro to set the maximum legacy value of 63) then the old API would be 100% backwards compatible.

Obviously that would also need changes to bcf_has_variant_type to get the value direct instead of calling bcf_get_variant_type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at the moment if you try linking bcftools against this branch you get six test failures. It's also arguable that calling the old API shouldn't store values beyond 63 in d.var[].type or d.var_type in case anyone tries to access them directly. This does happen a few times in bcftools, although luckily they all seem to use bitwise operators instead of equality checks so would probably still work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just force pushed this change

vcf.c Outdated
Comment on lines 4264 to 4279
inline static int _has_variant_type(int type, int bitmask, enum bcf_variant_match mode)
{
if ( mode==overlap ) return type & bitmask;

// VCF_INDEL is always set with VCF_INS and VCF_DEL by bcf_set_variant_type[s], but the bitmask may
// ask for say `VCF_INS` or `VCF_INDEL` only
if ( bitmask&(VCF_INS|VCF_DEL) && !(bitmask&VCF_INDEL) ) type &= ~VCF_INDEL;
else if ( bitmask&VCF_INDEL && !(bitmask&(VCF_INS|VCF_DEL)) ) type &= ~(VCF_INS|VCF_DEL);

if ( mode==subset )
{
if ( ~bitmask & type ) return 0;
else return bitmask & type;
}
return type==bitmask ? type : 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, ideally we shouldn't use functions starting with underscore as they are meant to be system specific (although I'd be very surprised if anything actually checked).

My main concern though is this seems very complex. I don't understand what's going on with the VCF_INS|VCF_DEL stuff. What's the purpose of treating these fields in a special way before doing the AND checks? It'll also be a nightmare to document all these special cases, and indeed they're not documented in the public header file.

We could just add a 2 line bcf_get_variant_type2 function which returns the full range of bits rather than the legacy truncated-range, and then let the calling code do whatever bit-checks it needs to do.

As an eample:

int bcf_get_variant_type2(bcf1_t *rec, int ith_allele)
{
    if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec);
    return rec->d.var[ith_allele].type;
 }

int bcf_get_variant_type(bcf1_t *rec, int ith_allele)
{
    return bcf_get_variant_type2(rec, ith_allele) & 63;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bcf_has_variant_type() interface has one advantage, which is that it's less likely to be caught out by future changes which add more variant types. Defining a replacement bcf_get_variant_type2() in exactly the same way as the original would hit the problem that we have with the old one - you can't easily extend it. Adding a bitmask parameter would help with this, as callers would be able to set an expectation as to the types returned.

Whatever interface we end up with, it would also be useful to optionally return the d.var[].n or d.n_var values, as currently they can only be reached by accessing the data structure directly. This could be done by adding an int * parameter (which could be NULL if you're not interested in the value).

Another issue that would be worth considering is that neither the old or proposed interfaces can return an error code. It would be useful to be able to do that so we can eventually propagate errors up from bcf_set_variant_types() which attempts to allocate memory, which could fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is explained in the comment. Please re-read, I really don't know how to explain this better :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains what it does, but not why it does it.

I also disagree that this is necessary in order to protect against future changes. We don't have a similar API for BAM flags for example. Instead it's documented that it is a clear bit-field and not something you should be doing naive equality checks against.

The correct way of handling this is via better documentation of the structure and API, instead of inventing a completely new API to do "and" and "or" checks when the language has them built in anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says "why". The user may want to ask for "is this an indel" or "is this an insertion". The rest is just following simple logic. This would of course not be required if you did not insist on non-breaking change. But if VCF_INDEL has to be preserved as a specific value rather than a combination of VCF_INS|VCF_DEL bits, this is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand.

I'm not saying your solution is wrong, but the usual way I see of doing this is to return the bit-field as-is and then have macros or static inline functions for interpreting it in a friendly fashion. Eg see the inode(7) man page.

The stat.st_mode field can be manually checked with AND/OR ops (eg if ((sb.st_mode & S_IFMT) == S_IFBLK) {...}, but there are also function-a-likes such as if (S_IFBLK(st_mode)) {...} to do the same thing. If we wanted something compound, such as either character device or block device, then the logic could be (sb_st.mode & (S_IFBLK|S_IFCHR)) == (S_IFBLK|S_IFCHR) which can be made into e.g. S_IFDEV() function.

The analogue here would be a VCF_IS_INS(), VCF_IS_DEL() and VCF_IS_INDEL() functions which take the type field and apply appropriate boolean logic to return true or false. That's sort of what I was expecting mainly due to familiarity with POSIX interfaces, but I can see combined the match type (overlap, subset, exact) and bitmask into a single API achieves the same goal. So as I say it's neither right nor wrong, but something to consider. It's possible the POSIX style may be more in keeping with existing htslib APIs (eg bcf_gt_is_missing, bcf_gt_is_phased, bcf_float_is_vector_end, bam_is_rev etc).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy for this to be changed to whatever you think will be most convenient

When the old bcf_get_variant_type[s] functions are used, the values
stored and returned are identical to the old interface.
@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 15, 2022

I've been looking at potentially changing this to use VCF_IS_XYZ style macros, like the posix stat(2) system call uses (aka inode(7)), or our bam_is_rev() function.

I've reviewed the bcftools code that is using these fields to see how many honour their bit-field status and how many are (wrongly?) doing direct equality checks. My summary is:

plugins/counts.c:       AND     type & VCF_SNP / MNP / OTHER                    
plugins/fill-tags:      AND     & SNP, MNP, OTHER, BND / OVERLAP                
plugins/fixref.c:       EQUAL   != SNP                                          
"/parental-origin.c:    EQUAL   != SNP                                          
"/smpl-stats.c:         EQUAL   ==SNP || ==MNP                                  
"/guess-ploidy.c:       AND     & SNP                                           
consensus.c             AND     & OTHER                                         
convert.c               AND     & SNP / MNP / OTHER / BND / OVERLAP             
filter.c                ??                                                      
vcffilter.c             AND     & SNP, & SNP|MNP|BND|OTHER|OVERLAP combo        
vcf_merge.c             AND     & SNP, & SNP|MNP, & INDEL                       
vcf_norm.c              EQUAL   == VCF_BND                                      
vcf_norm.c              AND     & SNP, & (SNP|MNP)                              
vcfstats.c              EQUAL   == SNP                                          
vcfstats.c              AND     & SNP / MNP / OTHER                             
vcfview.c               AND     & (SNP|INDEL|MNP|OTHER|BND)                     
                                NB "REF" => "OTHER". Bug?                       

So we see quite a few do direct equality checks while other can cope with bit-masking. Most perculiar is vcfview.c:

https://github.com/samtools/bcftools/blob/develop/vcfview.c#L188-L195

All of these are e.g. args->include |= VCF_INDEL<<1. When they're used it's with (line_type<<1) & args->include. So why do we <<1 when setting and <<1 when comparing? I cannot see this used in any other context.

I initially thought maybe it's because of VCF_REF==0, but then that's not going to work as 0 can't be something you ever compare against. VCF_REF is simply the absence of everything else.

Furthermore, REF is permitted in the filter language, but it does the wrong thing:

    else if (strcmp(type_list[i], "ref") == 0) args->include |= VCF_OTHER<<1;

REF => OTHER. Is that a bug? What does OTHER normally mean? (I see one case of setting OTHER is when ALT is "NON_REF", so clearly there is something not quite right here)

As for the various EQUAL checks, some seem ok (plugins/fixref.c), while others are more dubious (https://github.com/samtools/bcftools/blob/develop/plugins/smpl-stats.c#L399). To review this PR needs knowledge of what types can coexist I guess.

If VCF has "A G,AT" would type get set to both SNP and INS? It feels like it should as both exist. Is it justified for tools to discard such lines as non-SNPs? Maybe fixref as it's hard to work, but for stats maybe not. What about vcfnorm.c which rejects VCF_BND when it's the only type, but not if it is combined with anything else. Is that an error? I suspect many of the equality checks could work as well, if not better, in a more appropriate bit-mask fashion.

@jkbonfield
Copy link
Contributor

More thoughts on this.

We have bcf_set_variant_type (singular) which traditionally only set type to one value (one bit). For this, type == VCF_SNP and type & VCF_SNP are identical as only one bit is set. The exception of course is VCF_INDEL which now occupies multiple bits, but had we been doing equality checking with & or the single allele query then we'd have been immune to this change anyway. So, with hindsight, we should probably never do equality checks for single allele type.

For bcf_set_variant_types, multi-allelic, it ORs together all the single allele type values. That basically means all combinations are possible. It also sadly means it's impossible to know if one allele is REF and another is SNP as REF is 0 and cannot be ORed together in any meaningful way. (Perhaps in future we can add an additional REF to permit this check to work and avoid the bizarreness of REF == OTHER (or "NON_REF"!) in vcfview.c.) So for types plural, equality checking makes little sense unless it is explicitly to say all alleles are of the same class. So it has a use still, but is very isolated.

I'm thinking therefore:

#define VCF_IS_REF(t)     ( (t) == 0)                                           
#define VCF_HAS_SNP(t)     ((t) & VCF_SNP)                                      
#define VCF_HAS_MNP(t)     ((t) & VCF_MNP)                                      
#define VCF_HAS_INDEL(t)   ((t) & (VCF_INDEL|VCF_INS|VCF_DEL))                  
#define VCF_HAS_INS(t)     ((t) & VCF_INS)                                      
#define VCF_HAS_DEL(t)     ((t) & VCF_DEL)                                      
#define VCF_HAS_OTHER(t)   ((t) & VCF_OTHER)                                    
#define VCF_HAS_BND(t)     ((t) & VCF_BND)                                      
#define VCF_HAS_OVERLAP(t) ((t) & VCF_OVERLAP)                                  
#define VCF_HAS_ANY(t)     ((t) & VCF_ANY)                                      
                                                                     

I could be convinced also of VCF_IS_DEL if we need actual equality checking, but I remain unconvinced and it would limit our potential expansion in the same way we see already with INDEL. If we did go down that road though, then #define VCF_IS_BND(t) (((t) & VCF_ANY) == VCF_BND) gives us the best way of protecting against additional values being added at a later stage, such as expanded up the different break types. It would ensure code at least doesn't break when linked against newer libraries, but wouldn't solve the case of compiling against newer libraries.

@jkbonfield
Copy link
Contributor

Furthermore, REF is permitted in the filter language, but it does the wrong thing:

    else if (strcmp(type_list[i], "ref") == 0) args->include |= VCF_OTHER<<1;

REF => OTHER. Is that a bug? What does OTHER normally mean? (I see one case of setting OTHER is when ALT is "NON_REF", so clearly there is something not quite right here)

I see a bit more here now. <*> is a REF-only block in gVCF, which is categorised by VCF_OTHER and not REF. Presumably this is because VCF_REF isn't actually a bit so cannot be ORed onto the list of event types. There's text in the spec stating that <*> is preferred over the symbolic allele <NON_REF>, but as far as I can see that's the only place NON_REF occurs anywhere in the spec. I find it baffling that it's implying <NON_REF> and <*> are synonymous given our interpretation is they mean "ref". NON_REF appears to be a GATK invention, but they state it as being used to mark the possibility of an unknown non-reference allele. Matching this with type "ref" seems very counter-intuitive. Both the specification and bcftools documentation need to be much clearer here.

Note there are also other situations where VCF_OTHER can be the variant type, such as here. At a cursory glance this looks to be another form of indel, where the REF and ALT strings are differing in length. Maybe it's meant to be a catch all for ALT being non-ACGT, but I don't quite understand why it's checking the last char (ae/re) matches for some of those checks.

With hindsight it was probably a mistake to use OTHER as the gVCF matches-ref state given the catch-all nature of it. Just as we've split indel up into INS and DEL, I could imagine in future splitting OTHER into the various categories that produce it, only one of which is the <*> smbollic allele. So we need to be cogniscent of this and protect ourselves from future changes here.

@pd3
Copy link
Member Author

pd3 commented Jul 20, 2022

I reviewed all the occurrences and they all are fine except the "unseen" gVCF symbolic star allele (<*> also known as <NON_REF>) in vcfview.c. This makes bcftools view --types ref unable to distinguish between the star allele and other symbolic alleles, such as <CNV>, <INS>, etc. I will open an separate issue for this in bcftools as this behavior is not changed by this pull request which was not intended to make improvements in symbolic alleles.

Unrelated to that, it was suggested that when the old bcf_get_variant_type[s] functions are used, the new bits should not be set. The code would look like this in such case:

@@ -4251,14 +4251,29 @@ static int bcf_set_variant_types(bcf1_t *b)
     return 0;
 }
 
+// For backward compatibility when old bcf_get_variant_type[s] functions were used
+static void delete_new_variant_flags(bcf1_t *rec)
+{
+    int i;
+    for (i=1; i<rec->n_allele; i++) rec->d.var[i].type &= ~(VCF_INS|VCF_DEL);
+    rec->d.var_type &= ~(VCF_INS|VCF_DEL);
+}
 int bcf_get_variant_types(bcf1_t *rec)
 {
-    if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec);
+    if ( rec->d.var_type==-1 )
+    {
+        bcf_set_variant_types(rec);
+        delete_new_variant_flags(rec);
+    }
     return rec->d.var_type & ~(VCF_INS|VCF_DEL);
 }
 int bcf_get_variant_type(bcf1_t *rec, int ith_allele)
 {
-    if ( rec->d.var_type==-1 ) bcf_set_variant_types(rec);
+    if ( rec->d.var_type==-1 )
+    {
+        bcf_set_variant_types(rec);
+        delete_new_variant_flags(rec);
+    }
     return rec->d.var[ith_allele].type & ~(VCF_INS|VCF_DEL);
 }
 inline static int has_variant_type(int type, int bitmask, enum bcf_variant_match mode)

I am not sure if it's worth it. What do people think?

@jkbonfield
Copy link
Contributor

I reviewed all the occurrences and they all are fine except the "unseen" gVCF symbolic star allele (<*> also known as <NON_REF>) in vcfview.c. This makes bcftools view --types ref unable to distinguish between the star allele and other symbolic alleles, such as <CNV>, <INS>, etc. I will open an separate issue for this in bcftools as this behavior is not changed by this pull request which was not intended to make improvements in symbolic alleles.

My point raised in the meeting wasn't that the equality checks were incorrect, but that some of them could have been bit checks instead. Basically anything working on a single allele.

Eg smpl-stats.c:

            int var_type = bcf_get_variant_type(rec, als[j]);
            if ( var_type==VCF_SNP || var_type==VCF_MNP )

This is a single allele (type and not types) so only ever has one bit set. Hence if (var_type & (VCF_SNP | VCF_MNP)) is equivalent. More importantly the if (var_type==VCF_INDEL) below could have been written as if (var_type & VCF_INDEL) also.

As == and & are equivalent for the single allele, and it was the equality check which caused much of the problems with adding INS and DEL type, we should be defensive in our approach and use bit arithmetic in preference whenever possible. Especially when it comes to OTHER as we already know it's likely to get expanded up to support SNVs and the like.

To this end, it's worth adding such commentary into the function documentation so people reading the header files know not to do equality checking unless there is absolutely no other alternative.

@pd3
Copy link
Member Author

pd3 commented Jul 20, 2022

@jkbonfield That's a code in bcftools and it already is dealt with by switching to the new API in the pull request samtools/bcftools#1747

if ( bcf_has_variant_type(rec, als[j], VCF_SNP|VCF_MNP, subset) ) ...

@daviesrob
Copy link
Member

So, I think the conclusion on this was that we're going to go with the bcf_has_variant_type style interface, but it could do with a bit of adjustment:

  • The mode parameter isn't useful for the single-allele bcf_has_variant_type() function, because with a single bit set in the bitmask all modes give the same answer; and with more than one bit set (apart from the INDEL ones) the subset and exact modes will always return false.
  • The symbols exact, overlap, subset risk name clashes. They could do with a prefix, e.g. hts_var_exact etc.
  • bcf_has_variant_type() and bcf_has_variant_types() could fail due to running out of memory, so they need a way of reporting possible errors. They could maybe return -1, or take an int *error on the end which could be set if they fail.
  • Getting at bcf_variant_t::n still involves going directly into the BCF data structures and knowing a lot about how they work. It would be useful to have a bcf_variant_num_bases() function to make this easier.

@daviesrob
Copy link
Member

I've had a go at implementing the suggested changes here, if anyone wants a look - the updates I made are in commit 7c3c29703. The corresponding changes to samtools/bcftools#1747 are in this branch. Looking at what I did there, I think the updated bcf_has_variant_type() API works reasonably well. I'm not quite so sure about bcf_has_variant_types() as the different comparison modes aren't really being used, but that may just be down to the particular use cases that are hit in the code being updated.

jkbonfield referenced this pull request Aug 9, 2022
* Remove `mode` from bcf_has_variant_type() interface

Individual alleles only have a single variant type, so the only
useful mode is the overlap one (bitwise-and).

* Put `bcf_match_` prefix on enumerated values, to avoid name
  clashes

* Make `bitmask` unsigned for more predictable bitwise operations
  (the return value still has to be signed, though).

* Return -1 if bcf_set_variant_types() fails, of if the requested
  allele is not valid.

* Add a bcf_variant_length() function, to more easily access the
  rec->d.var[].n field.

* Be more specific on specifying the mask used to restrict the
  types the old functions return, in case more are added later.

* Improve documentation in the header.
@jkbonfield
Copy link
Contributor

Comments added to commit 7c3c297 directly so they're along side the relevant code.

@jkbonfield
Copy link
Contributor

jkbonfield commented Aug 9, 2022

Regarding @daviesrob's bcftools branch, I see lots of:

int line_type = bcf_get_variant_types(buf->lines[j]);

being replaced with

int line_type = bcf_has_variant_types(buf->lines[j], VCF_ANY, bcf_match_overlap);

If that's how we're using it in practice, and replacing one API with the other, then it begs the question of whether we're correct to deprecate the old bcf_get_variant_types function. If we reimplemented it purely as a static inline to do that call, would it help? It feels like we should be saying the bcf_has_variant_types is the new recommended API, but a more limited bcf_get_variant_types is still supported for convenience, rather than making people jump through hoops.

Edit: Oops I recall the difference now: the old one is actually equivalent to bcf_has_variant_types(buf->lines[j], VCF_INS-1, bcf_match_overlap); as it automatically masks off the two new methods.

* Remove `mode` from bcf_has_variant_type() interface, and add
  a special case for `VCF_REF`

Individual alleles only have a single variant type, so the only
useful mode is the overlap one (bitwise-and).  The exception
is VCF_REF, which is encoded as 0, so has to be tested for by
equality.

* Put `bcf_match_` prefix on enumerated values, to avoid name
  clashes

* Make `bitmask` unsigned for more predictable bitwise operations
  (the return value still has to be signed, though).

* Return -1 if bcf_set_variant_types() fails, of if the requested
  allele is not valid.

As callers using the legacy API won't be checking for a -1 return,
these unfortunately need to be made to call exit(1) on failure.
This is however an improvement on what would have happened under
the same conditions before, which would most likely have been a
NULL pointer dereference.

* Add a bcf_variant_length() function, to more easily access the
  rec->d.var[].n field.

* Be more specific on specifying the mask used to restrict the
  types the old functions return, in case more are added later.

* Improve documentation in the header.
Change rockylinux docker image to rockylinux:9 following
deprecation of rockylinux:latest.

Add perl-FindBin to installation list, as it's now in its own
package.
@daviesrob
Copy link
Member

I've pushed my suggested updates, with adjustments following comments, as a new commit. I've also made a small change to bcf_has_variant_type() - if you set the bitmap argument to 0 (a.k.a. VCF_REF) then it switches to an equality test. That makes it a bit easier to test for the specific VCF_REF case.

@jkbonfield
Copy link
Contributor

Rebased and merged as 8d91938

@jkbonfield jkbonfield closed this Aug 11, 2022
jkbonfield pushed a commit to samtools/bcftools that referenced this pull request Aug 11, 2022
@pd3 pd3 deleted the bt1704-merge-ins-del branch August 15, 2022 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants