Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up of VCF parsing / writing with FORMAT fields. #1081

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Commits on May 28, 2020

  1. Speed up of VCF parsing / writing with FORMAT fields.

    If we're reading and writing VCF without intervening change to BCF or
    use of bcf_unpack with BCF_UN_FMT (so no querying and/or modification
    of the per-sample FORMAT columns), then we keep the data in string
    form instead of binary bcf form.
    
    The bcf binary form is held in kstring b->indiv.  We also hold the
    string format there too, but use b->unpacked & VCF_UN_FMT as a
    bit-flag to indicate whether this is encoding as VCF or BCF.
    
    The benefit is reading is considerably faster for many-sample files
    (~4 fold for a 2.5k sample 1000 genomes test) and similarly (~3 fold)
    for a read/write loop to VCF as we don't have to re-encode the data
    from uBCF to VCF either.
    
    If we're converting to BCF, or doing a query, we still need the call
    to vcf_parse_format as before, which reformats b->indiv from VCF to
    BCF internally.  This is done on the fly.
    
    There is some nastiness to how this works sadly as bcf_unpack just
    takes the bcf record and has no access to the header, but this is
    needed for parsing.  Therefore we stash a copy of it in the kstring
    along with the text itself, but this assumes that no one is doing
    header oddities such as reading the file, freeing the header, and
    then attempting to unpack the contents.  It feels like a safe
    assumption, although it's not something that has ever been explicitly
    spelt out before.
    
    Note this causes a few bcftools test failures due to fixups that
    happen when doing a full conversion to BCF.  A similar case happens
    here in test/noroundtrip.vcf, which has an underspecified FORMAT of
    "GT:GT 0/1".  The test checks it comes out as "GT:GT 2,4:.".  This
    won't happen with pure VCF as we're not parsing that data, but the
    test fix is to force conversion to BCF and back again to test that
    code path.  A similar fix would need to be made to the failing
    bcftools tests.
    jkbonfield committed May 28, 2020
    Configuration menu
    Copy the full SHA
    a482b91 View commit details
    Browse the repository at this point in the history