-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support multi-character newline delimiters with a read_text API #8557
Comments
cc @elstehle, @davidwendt |
How large are the files we'd expect to see with these file formats? |
"It depends" :) Genomes are a few GB raw, and can compress pretty well.. down to a few MB. But I'd say a great many analyses call for work across multiple files, so you can get large quickly. This implies the need for: cc @VibhuJawa |
Adds `multbyte_split` API, part of #8557. Takes one large text input and splits it in to a single strings column. - Features: - [x] split on multi-byte delimiters - [x] split on multiple delimiters simultaneously - [ ] erase delimiters from output (will implement later) - [ ] replace delimiters with alternate text (will implement later) - Supported input types - [x] `cudf::io::text::data_chunk_source` - [x] `cudf::string_scalar` via `cudf::device_span` - [x] `std::string` via `std::istream` - [x] files via `std::istream` - Supported delimiter type - [x] `std::string` - Performance Goals - [x] ~2G/s from file, ~4G/s on-device. There is room for improvement, but perf is good enough for now. - Additional goals: - [x] add reusable block-level pattern-matching utility. - [ ] add reusable block-level utility to "peek" at "future" scan states (will implement with delimiter erasure). Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Devavret Makkar (https://github.com/devavret) - Jake Hemstad (https://github.com/jrhemstad) URL: #8702
Update: With #8702 merged in, I believe we now need Python bindings |
Yes, I will update those bindings right now actually, if they need it, and then change the PR from WIP to ready for review. I know a few things such as multiple inputs need to be removed from the bindings as @cwharris made some modifications where multiple inputs are forbidden so need to change that. |
Provides the Python/Cython bindings for #8702 multibyte_split. This PR depends on #8702 being merged first. Closes #8557 Authors: - Jeremy Dyer (https://github.com/jdye64) - Christopher Harris (https://github.com/cwharris) Approvers: - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #8998
There are many semi-structured text file formats which require domain specific parsing logic to load into a DataFrame before analysis can be performed. For example, in genomics there are FASTA and VCF formats, each with various Python libraries that ease parsing into Pandas DataFrames.
In cuDF, we have quite a few string functions we can use to implement custom parsing logic for text formats, but there's an initial difficulty in loading the raw text into rows in a DataFrame. It's possible to use read_csv with a newline delimiter, but that's limited to a single character, and there's significant overhead using read_csv when the files aren't actually CSVs.
It would be useful to have a
read_text
API which loads text into rows, but supports splitting records on a multi-character delimiter.Borrowing from the FASTA file example, something like
cudf.read_text('fasta.example', delim='*\n')
would result in a single string column with 3 records.It's possible to do this parsing on the CPU and construct a cudf.DataFrame from objects in host memory, but that also has performance penalties.
The text was updated successfully, but these errors were encountered: