-
Notifications
You must be signed in to change notification settings - Fork 444
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Speed up of VCF parsing / writing with FORMAT fields.
If we're reading and writing VCF without intervening change to BCF or use of bcf_unpack with BCF_UN_FMT (so no querying and/or modification of the per-sample FORMAT columns), then we keep the data in string form instead of binary bcf form. The bcf binary form is held in kstring b->indiv. We also hold the string format there too, but use b->unpacked & VCF_UN_FMT as a bit-flag to indicate whether this is encoding as VCF or BCF. The benefit is reading is considerably faster for many-sample files (~4 fold for a 2.5k sample 1000 genomes test) and similarly (~3 fold) for a read/write loop to VCF as we don't have to re-encode the data from uBCF to VCF either. If we're converting to BCF, or doing a query, we still need the call to vcf_parse_format as before, which reformats b->indiv from VCF to BCF internally. This is done on the fly. There is some nastiness to how this works sadly as bcf_unpack just takes the bcf record and has no access to the header, but this is needed for parsing. Therefore we stash a copy of it in the kstring along with the text itself, but this assumes that no one is doing header oddities such as reading the file, freeing the header, and then attempting to unpack the contents. It feels like a safe assumption, although it's not something that has ever been explicitly spelt out before. Note this causes a few bcftools test failures due to fixups that happen when doing a full conversion to BCF. A similar case happens here in test/noroundtrip.vcf, which has an underspecified FORMAT of "GT:GT 0/1". The test checks it comes out as "GT:GT 2,4:.". This won't happen with pure VCF as we're not parsing that data, but the test fix is to force conversion to BCF and back again to test that code path. A similar fix would need to be made to the failing bcftools tests.
- Loading branch information
1 parent
9de45b7
commit a482b91
Showing
3 changed files
with
136 additions
and
41 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters