Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Read Tabix indexed files in read_text #10466

Closed
cwharris opened this issue Mar 20, 2022 · 2 comments · Fixed by #11652
Closed

[FEA] Read Tabix indexed files in read_text #10466

cwharris opened this issue Mar 20, 2022 · 2 comments · Fixed by #11652
Assignees
Labels
cuIO cuIO issue feature request New feature or request

Comments

@cwharris
Copy link
Contributor

Tabix is used in genomics as a way to index into large files of deflated blocks, such that individual blocks can be inflated without inflating the entire file.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042176

To support the Tabix use case, we need a way to provide the offsets of deflated blocks within the context of a larger file. The larger file is simply a series of concatenated deflated blocks. We do not need to support anything Tabix specific, just the ability to inflate specific blocks from within a larger file of concatenated blocks. Tabix will provide the offsets, read_text will consume those offsets and be able to deflate the block at that offset. We should also provide the ability to specify a series of blocks to be inflated one after another.

From multibyte_split's perspective, nothing changes. It will continue to call get_next_chunk, which will provide uncompressed bytes. Under the hood, get_next_chunk will be implemented as a data_chunk_source and a new data chunk reader decorator. The data_chunk_source will read the compressed file as usual, but will need to be updated to support random access. The data chunk reader decorator will be given the series of chunk offsets and iterate through them one at a time as required to provide multibyte_split with the requested number of uncompressed bytes. To do this, it will need to deflate and cache each chunk in succession. For deflation, we can probably use the existing gpuinflate code we use for orc, avro, and parquet. For those readers, the blocks are wrapped in metadata. In this case, the metadata comes from the Tabix block offsets. Because we are deflating individual blocks, it is likely not worth waiting for nvcomp's full-file inflate algorithm.

@cwharris cwharris added feature request New feature or request Needs Triage Need team to review and classify labels Mar 20, 2022
@cwharris cwharris added pandas libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Python Affects Python cuDF API. and removed pandas libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Mar 20, 2022
@cwharris
Copy link
Contributor Author

@randerzander fyi

@cwharris cwharris changed the title [FEA] Need ability to read Tabix indexed files in read_text [FEA] Read Tabix indexed files in read_text Mar 22, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@upsj upsj self-assigned this Aug 17, 2022
@upsj upsj mentioned this issue Sep 5, 2022
7 tasks
rapids-bot bot pushed a commit that referenced this issue Sep 27, 2022
This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form
```
63                    16       0
+----------------------+-------+
|      block offset    | local |
+----------------------+-------+
```
The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets.

For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf).

Closes #10466 

## TODO
- [x] Use events to avoid clobbering data that is still in use
- [x] stricter handling of local_begin (currently it may overflow into subsequent blocks)
- [x] add tests where  local_begin and local_end are in the same chunk or even block
- [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions

Authors:
  - Tobias Ribizel (https://github.com/upsj)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11652
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants