[FEA] Read Tabix indexed files in read_text #10466

cwharris · 2022-03-20T02:56:37Z

Tabix is used in genomics as a way to index into large files of deflated blocks, such that individual blocks can be inflated without inflating the entire file.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176

To support the Tabix use case, we need a way to provide the offsets of deflated blocks within the context of a larger file. The larger file is simply a series of concatenated deflated blocks. We do not need to support anything Tabix specific, just the ability to inflate specific blocks from within a larger file of concatenated blocks. Tabix will provide the offsets, read_text will consume those offsets and be able to deflate the block at that offset. We should also provide the ability to specify a series of blocks to be inflated one after another.

From multibyte_split's perspective, nothing changes. It will continue to call get_next_chunk, which will provide uncompressed bytes. Under the hood, get_next_chunk will be implemented as a data_chunk_source and a new data chunk reader decorator. The data_chunk_source will read the compressed file as usual, but will need to be updated to support random access. The data chunk reader decorator will be given the series of chunk offsets and iterate through them one at a time as required to provide multibyte_split with the requested number of uncompressed bytes. To do this, it will need to deflate and cache each chunk in succession. For deflation, we can probably use the existing gpuinflate code we use for orc, avro, and parquet. For those readers, the blocks are wrapped in metadata. In this case, the metadata comes from the Tabix block offsets. Because we are deflating individual blocks, it is likely not worth waiting for nvcomp's full-file inflate algorithm.

The text was updated successfully, but these errors were encountered:

cwharris · 2022-03-20T03:00:58Z

@randerzander fyi

github-actions · 2022-04-21T03:30:51Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form ``` 63 16 0 +----------------------+-------+ | block offset | local | +----------------------+-------+ ``` The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets. For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf). Closes #10466 ## TODO - [x] Use events to avoid clobbering data that is still in use - [x] stricter handling of local_begin (currently it may overflow into subsequent blocks) - [x] add tests where local_begin and local_end are in the same chunk or even block - [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Michael Wang (https://github.com/isVoid) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11652

cwharris added feature request New feature or request Needs Triage Need team to review and classify labels Mar 20, 2022

cwharris added pandas libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Python Affects Python cuDF API. and removed pandas libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Mar 20, 2022

cwharris changed the title ~~[FEA] Need ability to read Tabix indexed files in read_text~~ [FEA] Read Tabix indexed files in read_text Mar 22, 2022

github-actions bot added the inactive-30d label Apr 21, 2022

GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022

GregoryKimball added this to the Genomics read_text support milestone Jul 2, 2022

upsj self-assigned this Aug 17, 2022

upsj mentioned this issue Sep 5, 2022

Add BGZIP data_chunk_reader #11652

Merged

7 tasks

rapids-bot bot closed this as completed in #11652 Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Read Tabix indexed files in read_text #10466

[FEA] Read Tabix indexed files in read_text #10466

cwharris commented Mar 20, 2022

cwharris commented Mar 20, 2022

github-actions bot commented Apr 21, 2022

[FEA] Read Tabix indexed files in read_text #10466

[FEA] Read Tabix indexed files in read_text #10466

Comments

cwharris commented Mar 20, 2022

cwharris commented Mar 20, 2022

github-actions bot commented Apr 21, 2022