-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Read Tabix indexed files in read_text #10466
Comments
@randerzander fyi |
This issue has been labeled |
This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form ``` 63 16 0 +----------------------+-------+ | block offset | local | +----------------------+-------+ ``` The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets. For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf). Closes #10466 ## TODO - [x] Use events to avoid clobbering data that is still in use - [x] stricter handling of local_begin (currently it may overflow into subsequent blocks) - [x] add tests where local_begin and local_end are in the same chunk or even block - [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Michael Wang (https://github.com/isVoid) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11652
Tabix is used in genomics as a way to index into large files of deflated blocks, such that individual blocks can be inflated without inflating the entire file.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176
To support the Tabix use case, we need a way to provide the offsets of deflated blocks within the context of a larger file. The larger file is simply a series of concatenated deflated blocks. We do not need to support anything Tabix specific, just the ability to inflate specific blocks from within a larger file of concatenated blocks. Tabix will provide the offsets, read_text will consume those offsets and be able to deflate the block at that offset. We should also provide the ability to specify a series of blocks to be inflated one after another.
From
multibyte_split
's perspective, nothing changes. It will continue to callget_next_chunk
, which will provide uncompressed bytes. Under the hood, get_next_chunk will be implemented as adata_chunk_source
and a new data chunk reader decorator. Thedata_chunk_source
will read the compressed file as usual, but will need to be updated to support random access. The data chunk reader decorator will be given the series of chunk offsets and iterate through them one at a time as required to providemultibyte_split
with the requested number of uncompressed bytes. To do this, it will need to deflate and cache each chunk in succession. For deflation, we can probably use the existinggpuinflate
code we use for orc, avro, and parquet. For those readers, the blocks are wrapped in metadata. In this case, the metadata comes from the Tabix block offsets. Because we are deflating individual blocks, it is likely not worth waiting for nvcomp's full-file inflate algorithm.The text was updated successfully, but these errors were encountered: