Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tabix support CSI indices with large positions. #1506

Merged
merged 1 commit into from
Sep 9, 2022

Conversation

jkbonfield
Copy link
Contributor

This already worked for SAM and VCF where the SQ and Contig lines indicate the maximum length of a reference sequence. However for BED files this was left as zero, which had the effect of fighting against the user by decreasing n_lvls as we increase min_shift.

When unknown, max_ref_len is now an arbitrary large size (100G), but this may produce more levels than are strictly necessary, although this doesn't appear to have negative consequences. For completeness however, the tabix command line now also permits this to be specified explicitly with --max-ref-len.

Also fixed the misleading error message about CSI being unable to index data. This was perhaps intended to be for mis-specified VCF data where a contig was listed as small but the records were at larger offsets, however it simply lead me up the garden path by categorically stating CSI cannot store such large values.

htslib/tbx.h Outdated
@@ -44,6 +44,7 @@ typedef struct tbx_conf_t {
int32_t preset;
int32_t sc, bc, ec; // seq col., beg col. and end col.
int32_t meta_char, line_skip;
int64_t max_ref_len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this likely breaks the ABI, especially as tbx_conf_t is directly bunged into tbx_t below so expanding it will cause all the other tbx_t members to move.

The simplest solution here might be to remove the part that lets you set the maximum length, and just go with the hard-coding to 100G for now. (The much more mind-boggling solution is to work out how to make hts_idx_push() expand n_lvls itself when the new item doesn't fit. This is probably a twenty pipe problem.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 100G the maximum length of a sequence or the total assembly ? The largest genome known so far is around 150 GB (Paris japonica).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the largest length of an individual chromosome.

I thought about ABI, but concluded it's just extending a structure which isn't likely to be used in an array. I didn't look very hard clearly as embedded it to the start of another struct is all of 2 lines lower!

It feels a bit poor to replace one arbitrary limit with another arbitrary limit, but I guess if it's large enough and we don't believe there are negative impacts of having it too large. What does it actually do? Maybe nothing if the levels don't get to be used due to the actual length not requiring them, but I'm hazy on this which worries me.

I can cull it back to a more minimal PR though if you don't wish to bump the ABI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer not to break the ABI over this.

This already worked for SAM and VCF where the SQ and Contig lines
indicate the maximum length of a reference sequence.  However for BED
files this was left as zero, which had the effect of fighting against
the user by decreasing n_lvls as we increase min_shift.

When unknown, max_ref_len is now an arbitrary large size (100G), but
this may produce more levels than are strictly necessary, although
this doesn't appear to have negative consequences.

Also fixed the misleading error message about CSI being unable to
index data.  This was perhaps intended to be for mis-specified VCF
data where a contig was listed as small but the records were at larger
offsets, however it simply lead me up the garden path by categorically
stating CSI cannot store such large values.
@jkbonfield
Copy link
Contributor Author

Following review, I've culled the configuration code and it's now hard coded at 100GB max ref len.

@daviesrob daviesrob merged commit 6366029 into samtools:develop Sep 9, 2022
@daviesrob
Copy link
Member

I think it'll do for now. It may be possible to come up with something more sophisticated later.

@jkbonfield
Copy link
Contributor Author

Happy with that way of thinking. This solves an immediate problem at least and (hopefully) doesn't create new ones.

I tested both old and new on a small bed file and the index was the same size, so the extra levels don't appear to matter in practice.

@muffato
Copy link

muffato commented Sep 9, 2022

Thank you very much !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants