Make tabix support CSI indices with large positions. #1506

jkbonfield · 2022-09-09T11:41:03Z

This already worked for SAM and VCF where the SQ and Contig lines indicate the maximum length of a reference sequence. However for BED files this was left as zero, which had the effect of fighting against the user by decreasing n_lvls as we increase min_shift.

When unknown, max_ref_len is now an arbitrary large size (100G), but this may produce more levels than are strictly necessary, although this doesn't appear to have negative consequences. For completeness however, the tabix command line now also permits this to be specified explicitly with --max-ref-len.

Also fixed the misleading error message about CSI being unable to index data. This was perhaps intended to be for mis-specified VCF data where a contig was listed as small but the records were at larger offsets, however it simply lead me up the garden path by categorically stating CSI cannot store such large values.

daviesrob · 2022-09-09T13:24:33Z

htslib/tbx.h

@@ -44,6 +44,7 @@ typedef struct tbx_conf_t {
    int32_t preset;
    int32_t sc, bc, ec; // seq col., beg col. and end col.
    int32_t meta_char, line_skip;
+    int64_t max_ref_len;


Unfortunately this likely breaks the ABI, especially as tbx_conf_t is directly bunged into tbx_t below so expanding it will cause all the other tbx_t members to move.

The simplest solution here might be to remove the part that lets you set the maximum length, and just go with the hard-coding to 100G for now. (The much more mind-boggling solution is to work out how to make hts_idx_push() expand n_lvls itself when the new item doesn't fit. This is probably a twenty pipe problem.)

Is this 100G the maximum length of a sequence or the total assembly ? The largest genome known so far is around 150 GB (Paris japonica).

It's the largest length of an individual chromosome.

I thought about ABI, but concluded it's just extending a structure which isn't likely to be used in an array. I didn't look very hard clearly as embedded it to the start of another struct is all of 2 lines lower!

It feels a bit poor to replace one arbitrary limit with another arbitrary limit, but I guess if it's large enough and we don't believe there are negative impacts of having it too large. What does it actually do? Maybe nothing if the levels don't get to be used due to the actual length not requiring them, but I'm hazy on this which worries me.

I can cull it back to a more minimal PR though if you don't wish to bump the ABI.

I think I'd prefer not to break the ABI over this.

This already worked for SAM and VCF where the SQ and Contig lines indicate the maximum length of a reference sequence. However for BED files this was left as zero, which had the effect of fighting against the user by decreasing n_lvls as we increase min_shift. When unknown, max_ref_len is now an arbitrary large size (100G), but this may produce more levels than are strictly necessary, although this doesn't appear to have negative consequences. Also fixed the misleading error message about CSI being unable to index data. This was perhaps intended to be for mis-specified VCF data where a contig was listed as small but the records were at larger offsets, however it simply lead me up the garden path by categorically stating CSI cannot store such large values.

jkbonfield · 2022-09-09T15:28:57Z

Following review, I've culled the configuration code and it's now hard coded at 100GB max ref len.

daviesrob · 2022-09-09T16:08:41Z

I think it'll do for now. It may be possible to come up with something more sophisticated later.

jkbonfield · 2022-09-09T16:11:06Z

Happy with that way of thinking. This solves an immediate problem at least and (hopefully) doesn't create new ones.

I tested both old and new on a small bed file and the index was the same size, so the extra levels don't appear to matter in practice.

muffato · 2022-09-09T17:31:51Z

Thank you very much !

daviesrob reviewed Sep 9, 2022

View reviewed changes

jkbonfield force-pushed the large_tabix branch from c94bc6f to 6366029 Compare September 9, 2022 15:28

daviesrob merged commit 6366029 into samtools:develop Sep 9, 2022

daviesrob mentioned this pull request Mar 30, 2023

Make reg2bins, reg2intervals faster on whole-chromosome queries #1596

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tabix support CSI indices with large positions. #1506

Make tabix support CSI indices with large positions. #1506

jkbonfield commented Sep 9, 2022

daviesrob Sep 9, 2022

muffato Sep 9, 2022

jkbonfield Sep 9, 2022

daviesrob Sep 9, 2022

jkbonfield commented Sep 9, 2022

daviesrob commented Sep 9, 2022

jkbonfield commented Sep 9, 2022

muffato commented Sep 9, 2022

Make tabix support CSI indices with large positions. #1506

Make tabix support CSI indices with large positions. #1506

Conversation

jkbonfield commented Sep 9, 2022

daviesrob Sep 9, 2022

Choose a reason for hiding this comment

muffato Sep 9, 2022

Choose a reason for hiding this comment

jkbonfield Sep 9, 2022

Choose a reason for hiding this comment

daviesrob Sep 9, 2022

Choose a reason for hiding this comment

jkbonfield commented Sep 9, 2022

daviesrob commented Sep 9, 2022

jkbonfield commented Sep 9, 2022

muffato commented Sep 9, 2022