-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support reading "chunks" of large text files w/ cudf.read_text #9655
Comments
I think this is a good understanding of the problem, and a viable solution. It is similar to how we handle CSV parsing today. However, this approach is limited in that each GPU must read from the beginning of the file up through the relevant byte range. Whereas ideally, each GPU only needs to read the relevant byte range. @randerzander and I discussed breaking the API in to multiple (two) phases to reduce the overall number of bytes each GPU needs to read from the input file.
NOTE: Both this approach and the current CSV approach read the file multiple times. The primary difference is that by sharing information across GPUs, we eliminate the need for each GPU to read from the beginning of the file. |
This issue has been labeled |
Adding byte_range support to multibyte_split/read_text. Closes #9655 providing a byte range in terms of `(offset, size)` allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which _begin_ within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but _end_ outside of the range. examples: ``` input: "abc..def..ghi..jkl.." delimiter: .. ``` ``` range offset: 0 range size: 2 output: ["abc.."] ``` ``` range offset: 2 range size: 9 output: ["def..", "ghi.."] ``` ``` range offset: 11 range size: 2 output: [] ``` ``` range offset: 13 range size: 7 output: ["jkl..", ""] ``` Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10150
While using cudf's read_text API, I've come across datasets w/ single files that are much larger (200gb+) than any single GPU's memory.
I can parse and split them into smaller pieces on the CPU first, but that is slow and expensive.
Ideally,
read_text
would support parameters likeread_csv
'snrows
&skiprows
parameters so I can either read serially in batches (or parallely with multiple GPUs).For example, with input
file.txt:
I could do:
I think implementing this correctly without corrupting or dropping records would depend on reading an arbitrary number of bytes in chunks, and detecting that:
cc @quasiben , @cwharris
The text was updated successfully, but these errors were encountered: