[FEA] Support multi-character newline delimiters with a read_text API #8557

randerzander · 2021-06-18T18:20:49Z

There are many semi-structured text file formats which require domain specific parsing logic to load into a DataFrame before analysis can be performed. For example, in genomics there are FASTA and VCF formats, each with various Python libraries that ease parsing into Pandas DataFrames.

In cuDF, we have quite a few string functions we can use to implement custom parsing logic for text formats, but there's an initial difficulty in loading the raw text into rows in a DataFrame. It's possible to use read_csv with a newline delimiter, but that's limited to a single character, and there's significant overhead using read_csv when the files aren't actually CSVs.

It would be useful to have a read_text API which loads text into rows, but supports splitting records on a multi-character delimiter.

Borrowing from the FASTA file example, something like cudf.read_text('fasta.example', delim='*\n') would result in a single string column with 3 records.

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

It's possible to do this parsing on the CPU and construct a cudf.DataFrame from objects in host memory, but that also has performance penalties.

The text was updated successfully, but these errors were encountered:

randerzander · 2021-06-18T18:24:44Z

cc @elstehle, @davidwendt

cwharris · 2021-06-18T18:59:25Z

How large are the files we'd expect to see with these file formats?

randerzander · 2021-06-18T19:31:18Z

How large are the files we'd expect to see with these file formats?

"It depends" :) Genomes are a few GB raw, and can compress pretty well.. down to a few MB. But I'd say a great many analyses call for work across multiple files, so you can get large quickly.

This implies the need for:
a. Compression support
b. Reading lists of multiple files into a single DataFrame like we can do with existing readers

cc @VibhuJawa

Adds `multbyte_split` API, part of #8557. Takes one large text input and splits it in to a single strings column. - Features: - [x] split on multi-byte delimiters - [x] split on multiple delimiters simultaneously - [ ] erase delimiters from output (will implement later) - [ ] replace delimiters with alternate text (will implement later) - Supported input types - [x] `cudf::io::text::data_chunk_source` - [x] `cudf::string_scalar` via `cudf::device_span` - [x] `std::string` via `std::istream` - [x] files via `std::istream` - Supported delimiter type - [x] `std::string` - Performance Goals - [x] ~2G/s from file, ~4G/s on-device. There is room for improvement, but perf is good enough for now. - Additional goals: - [x] add reusable block-level pattern-matching utility. - [ ] add reusable block-level utility to "peek" at "future" scan states (will implement with delimiter erasure). Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Devavret Makkar (https://github.com/devavret) - Jake Hemstad (https://github.com/jrhemstad) URL: #8702

quasiben · 2021-08-31T14:35:29Z

Update:

With #8702 merged in, I believe we now need Python bindings

elstehle · 2021-08-31T14:57:49Z

Update:

With #8702 merged in, I believe we now need Python bindings

@jdye64 is working on a PR for those in #8998

jdye64 · 2021-08-31T15:22:42Z

Update:
With #8702 merged in, I believe we now need Python bindings

@jdye64 is working on a PR for those in #8998

Yes, I will update those bindings right now actually, if they need it, and then change the PR from WIP to ready for review. I know a few things such as multiple inputs need to be removed from the bindings as @cwharris made some modifications where multiple inputs are forbidden so need to change that.

Provides the Python/Cython bindings for #8702 multibyte_split. This PR depends on #8702 being merged first. Closes #8557 Authors: - Jeremy Dyer (https://github.com/jdye64) - Christopher Harris (https://github.com/cwharris) Approvers: - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #8998

randerzander added feature request New feature or request Python Affects Python cuDF API. cuIO cuIO issue labels Jun 18, 2021

randerzander assigned cwharris Jun 18, 2021

cwharris mentioned this issue Jul 9, 2021

multibyte_split #8702

Merged

12 tasks

vuule assigned jdye64 Aug 31, 2021

vuule mentioned this issue Sep 15, 2021

Python/Cython bindings for multibyte_split #8998

Merged

rapids-bot bot closed this as completed in #8998 Sep 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support multi-character newline delimiters with a read_text API #8557

[FEA] Support multi-character newline delimiters with a read_text API #8557

randerzander commented Jun 18, 2021

randerzander commented Jun 18, 2021

cwharris commented Jun 18, 2021

randerzander commented Jun 18, 2021

quasiben commented Aug 31, 2021

elstehle commented Aug 31, 2021

jdye64 commented Aug 31, 2021

[FEA] Support multi-character newline delimiters with a read_text API #8557

[FEA] Support multi-character newline delimiters with a read_text API #8557

Comments

randerzander commented Jun 18, 2021

randerzander commented Jun 18, 2021

cwharris commented Jun 18, 2021

randerzander commented Jun 18, 2021

quasiben commented Aug 31, 2021

elstehle commented Aug 31, 2021

jdye64 commented Aug 31, 2021