Speed up of VCF parsing / writing with FORMAT fields. #1081

jkbonfield · 2020-05-28T10:28:58Z

If we're reading and writing VCF without intervening change to BCF or
use of bcf_unpack with BCF_UN_FMT (so no querying and/or modification
of the per-sample FORMAT columns), then we keep the data in string
form instead of binary bcf form.

The bcf binary form is held in kstring b->indiv. We also hold the
string format there too, but use b->unpacked & VCF_UN_FMT as a
bit-flag to indicate whether this is encoding as VCF or BCF.

The benefit is reading is considerably faster for many-sample files
(~4 fold for a 2.5k sample 1000 genomes test) and similarly (~3 fold)
for a read/write loop to VCF as we don't have to re-encode the data
from uBCF to VCF either.

If we're converting to BCF, or doing a query, we still need the call
to vcf_parse_format as before, which reformats b->indiv from VCF to
BCF internally. This is done on the fly.

There is some nastiness to how this works sadly as bcf_unpack just
takes the bcf record and has no access to the header, but this is
needed for parsing. Therefore we stash a copy of it in the kstring
along with the text itself, but this assumes that no one is doing
header oddities such as reading the file, freeing the header, and
then attempting to unpack the contents. It feels like a safe
assumption, although it's not something that has ever been explicitly
spelt out before.

Note this causes a few bcftools test failures due to fixups that
happen when doing a full conversion to BCF. For example
However this now causes a test failure. Eg test/noroundtrip.vcf has
FORMAT of "GT:GT 0/1" but if we force into BCF and back out again we
get "GT:GT 2,4:.".

jmarshall · 2020-05-28T11:00:28Z

The last paragraph of the commit message got a bit mixed up.

What is the reason for the test/test.pl change? If it's to add a test case involving going via an actual .bcf file, perhaps better to add it as an additional test case alongside rather than removing the existing test case?

jkbonfield · 2020-05-28T11:15:44Z

Oops. I've been hitting my head against getting this PR working given the nightmare of a PR against a PR that kept changing (not something I could avoid sadly), until Rob reminded me about cherry-pick, so my brain was probably fried :-)

The test change is because (I think) that test is there specifically to check the impact of the changes made to partially specified FORMAT fields. They're fixed up with missing values. That doesn't happen any more if you leave things purely in VCF format as it doesn't even parse that data. Hence the change is to force it to go via BCF, in order to still trigger and exercise that part of the code.

(Ie if the existing test was left in place, we'd have to label it as a known failure, which feels unpleasant.)

If we're reading and writing VCF without intervening change to BCF or use of bcf_unpack with BCF_UN_FMT (so no querying and/or modification of the per-sample FORMAT columns), then we keep the data in string form instead of binary bcf form. The bcf binary form is held in kstring b->indiv. We also hold the string format there too, but use b->unpacked & VCF_UN_FMT as a bit-flag to indicate whether this is encoding as VCF or BCF. The benefit is reading is considerably faster for many-sample files (~4 fold for a 2.5k sample 1000 genomes test) and similarly (~3 fold) for a read/write loop to VCF as we don't have to re-encode the data from uBCF to VCF either. If we're converting to BCF, or doing a query, we still need the call to vcf_parse_format as before, which reformats b->indiv from VCF to BCF internally. This is done on the fly. There is some nastiness to how this works sadly as bcf_unpack just takes the bcf record and has no access to the header, but this is needed for parsing. Therefore we stash a copy of it in the kstring along with the text itself, but this assumes that no one is doing header oddities such as reading the file, freeing the header, and then attempting to unpack the contents. It feels like a safe assumption, although it's not something that has ever been explicitly spelt out before. Note this causes a few bcftools test failures due to fixups that happen when doing a full conversion to BCF. A similar case happens here in test/noroundtrip.vcf, which has an underspecified FORMAT of "GT:GT 0/1". The test checks it comes out as "GT:GT 2,4:.". This won't happen with pure VCF as we're not parsing that data, but the test fix is to force conversion to BCF and back again to test that code path. A similar fix would need to be made to the failing bcftools tests.

jmarshall · 2020-05-28T11:21:23Z

Fair enough, and my confusion is now fended off by the fixed up commit message paragraph 😄

jkbonfield · 2020-05-28T11:24:39Z

No problem and thanks for pointing out the errant commit message.

jkbonfield force-pushed the vcf_parser_jkb_rebased branch from 41a4c87 to a482b91 Compare May 28, 2020 11:19

jkbonfield mentioned this pull request Jul 12, 2023

Speed up the VCF parser #1644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up of VCF parsing / writing with FORMAT fields. #1081

Speed up of VCF parsing / writing with FORMAT fields. #1081

jkbonfield commented May 28, 2020

jmarshall commented May 28, 2020

jkbonfield commented May 28, 2020 •

edited

Loading

jmarshall commented May 28, 2020

jkbonfield commented May 28, 2020

Speed up of VCF parsing / writing with FORMAT fields. #1081

Are you sure you want to change the base?

Speed up of VCF parsing / writing with FORMAT fields. #1081

Conversation

jkbonfield commented May 28, 2020

jmarshall commented May 28, 2020

jkbonfield commented May 28, 2020 • edited Loading

jmarshall commented May 28, 2020

jkbonfield commented May 28, 2020

jkbonfield commented May 28, 2020 •

edited

Loading