Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support multi-character newline delimiters with a read_text API #8557

Closed
randerzander opened this issue Jun 18, 2021 · 6 comments · Fixed by #8998
Closed

[FEA] Support multi-character newline delimiters with a read_text API #8557

randerzander opened this issue Jun 18, 2021 · 6 comments · Fixed by #8998
Assignees
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

There are many semi-structured text file formats which require domain specific parsing logic to load into a DataFrame before analysis can be performed. For example, in genomics there are FASTA and VCF formats, each with various Python libraries that ease parsing into Pandas DataFrames.

In cuDF, we have quite a few string functions we can use to implement custom parsing logic for text formats, but there's an initial difficulty in loading the raw text into rows in a DataFrame. It's possible to use read_csv with a newline delimiter, but that's limited to a single character, and there's significant overhead using read_csv when the files aren't actually CSVs.

It would be useful to have a read_text API which loads text into rows, but supports splitting records on a multi-character delimiter.

Borrowing from the FASTA file example, something like cudf.read_text('fasta.example', delim='*\n') would result in a single string column with 3 records.

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

It's possible to do this parsing on the CPU and construct a cudf.DataFrame from objects in host memory, but that also has performance penalties.

@randerzander randerzander added feature request New feature or request Python Affects Python cuDF API. cuIO cuIO issue labels Jun 18, 2021
@randerzander
Copy link
Contributor Author

cc @elstehle, @davidwendt

@cwharris
Copy link
Contributor

How large are the files we'd expect to see with these file formats?

@randerzander
Copy link
Contributor Author

How large are the files we'd expect to see with these file formats?

"It depends" :) Genomes are a few GB raw, and can compress pretty well.. down to a few MB. But I'd say a great many analyses call for work across multiple files, so you can get large quickly.

This implies the need for:
a. Compression support
b. Reading lists of multiple files into a single DataFrame like we can do with existing readers

cc @VibhuJawa

@cwharris cwharris mentioned this issue Jul 9, 2021
12 tasks
rapids-bot bot pushed a commit that referenced this issue Aug 24, 2021
Adds `multbyte_split` API, part of #8557. Takes one large text input and splits it in to a single strings column.

- Features:
  - [x] split on multi-byte delimiters
  - [x] split on multiple delimiters simultaneously
  - [ ] erase delimiters from output (will implement later)
  - [ ] replace delimiters with alternate text (will implement later)
- Supported input types
  - [x] `cudf::io::text::data_chunk_source`
    - [x] `cudf::string_scalar` via `cudf::device_span`
    - [x] `std::string` via `std::istream`
    - [x] files via `std::istream`
- Supported delimiter type
  - [x] `std::string`
- Performance Goals
  - [x] ~2G/s from file, ~4G/s on-device. There is room for improvement, but perf is good enough for now.
- Additional goals:
  - [x] add reusable block-level pattern-matching utility.
  - [ ] add reusable block-level utility to "peek" at "future" scan states (will implement with delimiter erasure).

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Elias Stehle (https://github.com/elstehle)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Devavret Makkar (https://github.com/devavret)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #8702
@quasiben
Copy link
Member

Update:

With #8702 merged in, I believe we now need Python bindings

@elstehle
Copy link
Contributor

Update:

With #8702 merged in, I believe we now need Python bindings

@jdye64 is working on a PR for those in #8998

@jdye64
Copy link
Contributor

jdye64 commented Aug 31, 2021

Update:
With #8702 merged in, I believe we now need Python bindings

@jdye64 is working on a PR for those in #8998

Yes, I will update those bindings right now actually, if they need it, and then change the PR from WIP to ready for review. I know a few things such as multiple inputs need to be removed from the bindings as @cwharris made some modifications where multiple inputs are forbidden so need to change that.

rapids-bot bot pushed a commit that referenced this issue Sep 17, 2021
Provides the Python/Cython bindings for #8702 multibyte_split. This PR depends on #8702 being merged first.

Closes #8557

Authors:
  - Jeremy Dyer (https://github.com/jdye64)
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #8998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants