-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tabix support CSI indices with large positions. #1506
Conversation
htslib/tbx.h
Outdated
@@ -44,6 +44,7 @@ typedef struct tbx_conf_t { | |||
int32_t preset; | |||
int32_t sc, bc, ec; // seq col., beg col. and end col. | |||
int32_t meta_char, line_skip; | |||
int64_t max_ref_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this likely breaks the ABI, especially as tbx_conf_t
is directly bunged into tbx_t
below so expanding it will cause all the other tbx_t
members to move.
The simplest solution here might be to remove the part that lets you set the maximum length, and just go with the hard-coding to 100G for now. (The much more mind-boggling solution is to work out how to make hts_idx_push()
expand n_lvls
itself when the new item doesn't fit. This is probably a twenty pipe problem.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 100G the maximum length of a sequence or the total assembly ? The largest genome known so far is around 150 GB (Paris japonica).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the largest length of an individual chromosome.
I thought about ABI, but concluded it's just extending a structure which isn't likely to be used in an array. I didn't look very hard clearly as embedded it to the start of another struct is all of 2 lines lower!
It feels a bit poor to replace one arbitrary limit with another arbitrary limit, but I guess if it's large enough and we don't believe there are negative impacts of having it too large. What does it actually do? Maybe nothing if the levels don't get to be used due to the actual length not requiring them, but I'm hazy on this which worries me.
I can cull it back to a more minimal PR though if you don't wish to bump the ABI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer not to break the ABI over this.
This already worked for SAM and VCF where the SQ and Contig lines indicate the maximum length of a reference sequence. However for BED files this was left as zero, which had the effect of fighting against the user by decreasing n_lvls as we increase min_shift. When unknown, max_ref_len is now an arbitrary large size (100G), but this may produce more levels than are strictly necessary, although this doesn't appear to have negative consequences. Also fixed the misleading error message about CSI being unable to index data. This was perhaps intended to be for mis-specified VCF data where a contig was listed as small but the records were at larger offsets, however it simply lead me up the garden path by categorically stating CSI cannot store such large values.
c94bc6f
to
6366029
Compare
Following review, I've culled the configuration code and it's now hard coded at 100GB max ref len. |
I think it'll do for now. It may be possible to come up with something more sophisticated later. |
Happy with that way of thinking. This solves an immediate problem at least and (hopefully) doesn't create new ones. I tested both old and new on a small bed file and the index was the same size, so the extra levels don't appear to matter in practice. |
Thank you very much ! |
This already worked for SAM and VCF where the SQ and Contig lines indicate the maximum length of a reference sequence. However for BED files this was left as zero, which had the effect of fighting against the user by decreasing n_lvls as we increase min_shift.
When unknown, max_ref_len is now an arbitrary large size (100G), but this may produce more levels than are strictly necessary, although this doesn't appear to have negative consequences. For completeness however, the tabix command line now also permits this to be specified explicitly with --max-ref-len.
Also fixed the misleading error message about CSI being unable to index data. This was perhaps intended to be for mis-specified VCF data where a contig was listed as small but the records were at larger offsets, however it simply lead me up the garden path by categorically stating CSI cannot store such large values.