[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

randerzander · 2021-11-10T20:38:52Z

While using cudf's read_text API, I've come across datasets w/ single files that are much larger (200gb+) than any single GPU's memory.

I can parse and split them into smaller pieces on the CPU first, but that is slow and expensive.

Ideally, read_text would support parameters like read_csv's nrows & skiprows parameters so I can either read serially in batches (or parallely with multiple GPUs).

For example, with input
file.txt:

0somestring*EOR*
1someotherstring*EOR*
2someotherhugeVERYlongstring*EOR*

I could do:

import cudf

delim = '*EOR*\n'
ser1 = cudf.read_txt('file.txt', delimiter=delim, nrows=2)
process_data(ser1)

del ser1
ser2 = cudf.read_txt('file.txt', delimiter=delim, skiprows=2)
process_data(ser2)

I think implementing this correctly without corrupting or dropping records would depend on reading an arbitrary number of bytes in chunks, and detecting that:

At the end of one chunk, a partial record (no EOR was seen) and should be dropped
At the beginning of one chunk, a partial record was read and I should read backwards an additional number of bytes (to the next EOR)

cc @quasiben , @cwharris

The text was updated successfully, but these errors were encountered:

cwharris · 2021-11-11T15:52:48Z

I think implementing this correctly without corrupting or dropping records would depend on reading an arbitrary number of bytes in chunks, and detecting that:
At the end of one chunk, a partial record (no EOR was seen) and should be dropped
At the beginning of one chunk, a partial record was read and I should read backwards an additional number of bytes (to the next EOR)

I think this is a good understanding of the problem, and a viable solution. It is similar to how we handle CSV parsing today. However, this approach is limited in that each GPU must read from the beginning of the file up through the relevant byte range. Whereas ideally, each GPU only needs to read the relevant byte range.

@randerzander and I discussed breaking the API in to multiple (two) phases to reduce the overall number of bytes each GPU needs to read from the input file.

each GPU reads the relevant byte range, and uses it to produce a GPU-partial aggregation of the machine state.
inclusive scan the GPU-partial machine states to produce true starting state seeds for each GPU.
each GPU uses the true starting state to begin parsing records, ignoring all characters up to the first record start, and greedily taking the last record, regardless of whether it extends past the byte range end. In this way, if a record begins on a GPU, it is read to that GPU.

NOTE:

Both this approach and the current CSV approach read the file multiple times. The primary difference is that by sharing information across GPUs, we eliminate the need for each GPU to read from the beginning of the file.

github-actions · 2021-12-12T17:06:52Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Adding byte_range support to multibyte_split/read_text. Closes #9655 providing a byte range in terms of `(offset, size)` allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which _begin_ within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but _end_ outside of the range. examples: ``` input: "abc..def..ghi..jkl.." delimiter: .. ``` ``` range offset: 0 range size: 2 output: ["abc.."] ``` ``` range offset: 2 range size: 9 output: ["def..", "ghi.."] ``` ``` range offset: 11 range size: 2 output: [] ``` ``` range offset: 13 range size: 7 output: ["jkl..", ""] ``` Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10150

randerzander added feature request New feature or request Needs Triage Need team to review and classify labels Nov 10, 2021

cwharris self-assigned this Nov 11, 2021

beckernick added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 12, 2021

github-actions bot added the inactive-30d label Dec 12, 2021

cwharris mentioned this issue Jan 27, 2022

byte_range support for multibyte_split/read_text #10150

Merged

rapids-bot bot closed this as completed in #10150 Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

randerzander commented Nov 10, 2021

cwharris commented Nov 11, 2021 •

edited

Loading

github-actions bot commented Dec 12, 2021

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655

Comments

randerzander commented Nov 10, 2021

cwharris commented Nov 11, 2021 • edited Loading

github-actions bot commented Dec 12, 2021

cwharris commented Nov 11, 2021 •

edited

Loading