-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559
Comments
Hello @shrshi, would you please add python bindings for the JSON reader option As far as the data source indexes.. we can leave that for a follow-on PR. |
Hello @galipremsagar, it looks like @shrshi is getting pulled deeper into the JSON reader tree algorithms, would you please help us add the |
Sure, I'll take care of it. |
Thank you @galipremsagar |
Fixes: #15559 This PR implements `on_bad_lines` in json reader. When `on_bad_lines="recover"`, bad lines are replaced by `<NA>` values. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #15834
Thank you all! Much appreciated! |
Is your feature request related to a problem? Please describe.
Hi!
When passing a list of files to cudf.read_json, it might fail because any of the json objects cannot be parsed because of any reason.
With the current Python API, we cannot distinguish which file failed, so it makes hard to debug those failures when dealing with thousands of jsonl files (for instance, when curating a corpus for LLM training).
Therefore, if the following functionalities would be available:
With those two functionalities, reading thousands of jsonl files in parallel while managing exceptions while reading them would be much efficient (reading files in a list is several times faster than reading them one after the other).
Already discussed with @vuule.
Thanks!
P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.
The text was updated successfully, but these errors were encountered: