Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

Closed
randerzander opened this issue Nov 10, 2021 · 2 comments · Fixed by #10150
Closed

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

randerzander opened this issue Nov 10, 2021 · 2 comments · Fixed by #10150
Assignees
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

While using cudf's read_text API, I've come across datasets w/ single files that are much larger (200gb+) than any single GPU's memory.

I can parse and split them into smaller pieces on the CPU first, but that is slow and expensive.

Ideally, read_text would support parameters like read_csv's nrows & skiprows parameters so I can either read serially in batches (or parallely with multiple GPUs).

For example, with input
file.txt:

0somestring*EOR*
1someotherstring*EOR*
2someotherhugeVERYlongstring*EOR*

I could do:

import cudf

delim = '*EOR*\n'
ser1 = cudf.read_txt('file.txt', delimiter=delim, nrows=2)
process_data(ser1)

del ser1
ser2 = cudf.read_txt('file.txt', delimiter=delim, skiprows=2)
process_data(ser2)

I think implementing this correctly without corrupting or dropping records would depend on reading an arbitrary number of bytes in chunks, and detecting that:

  1. At the end of one chunk, a partial record (no EOR was seen) and should be dropped
  2. At the beginning of one chunk, a partial record was read and I should read backwards an additional number of bytes (to the next EOR)

cc @quasiben , @cwharris

@randerzander randerzander added feature request New feature or request Needs Triage Need team to review and classify labels Nov 10, 2021
@cwharris cwharris self-assigned this Nov 11, 2021
@cwharris
Copy link
Contributor

cwharris commented Nov 11, 2021

I think implementing this correctly without corrupting or dropping records would depend on reading an arbitrary number of bytes in chunks, and detecting that:
At the end of one chunk, a partial record (no EOR was seen) and should be dropped
At the beginning of one chunk, a partial record was read and I should read backwards an additional number of bytes (to the next EOR)

I think this is a good understanding of the problem, and a viable solution. It is similar to how we handle CSV parsing today. However, this approach is limited in that each GPU must read from the beginning of the file up through the relevant byte range. Whereas ideally, each GPU only needs to read the relevant byte range.

@randerzander and I discussed breaking the API in to multiple (two) phases to reduce the overall number of bytes each GPU needs to read from the input file.

  1. each GPU reads the relevant byte range, and uses it to produce a GPU-partial aggregation of the machine state.
  2. inclusive scan the GPU-partial machine states to produce true starting state seeds for each GPU.
  3. each GPU uses the true starting state to begin parsing records, ignoring all characters up to the first record start, and greedily taking the last record, regardless of whether it extends past the byte range end. In this way, if a record begins on a GPU, it is read to that GPU.

NOTE:

Both this approach and the current CSV approach read the file multiple times. The primary difference is that by sharing information across GPUs, we eliminate the need for each GPU to read from the beginning of the file.

@beckernick beckernick added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 12, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 1, 2022
Adding byte_range support to multibyte_split/read_text.

Closes #9655

providing a byte range in terms of `(offset, size)` allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which _begin_ within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but _end_ outside of the range.

examples:
```
input: "abc..def..ghi..jkl.."
delimiter: ..
```
```
range offset: 0
range size: 2
output: ["abc.."]
```
```
range offset: 2
range size: 9
output: ["def..", "ghi.."]
```
```
range offset: 11
range size: 2
output: []
```
```
range offset: 13
range size: 7
output: ["jkl..", ""]
```

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10150
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants