Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

Closed
miguelusque opened this issue Apr 17, 2024 · 5 comments · Fixed by #15834
Assignees
Labels
feature request New feature or request

Comments

@miguelusque
Copy link
Member

Is your feature request related to a problem? Please describe.
Hi!

When passing a list of files to cudf.read_json, it might fail because any of the json objects cannot be parsed because of any reason.

With the current Python API, we cannot distinguish which file failed, so it makes hard to debug those failures when dealing with thousands of jsonl files (for instance, when curating a corpus for LLM training).

Therefore, if the following functionalities would be available:

  • Create a row with nulls when one of the json objects of a file canot be read, and
  • Optionally, adding a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.

With those two functionalities, reading thousands of jsonl files in parallel while managing exceptions while reading them would be much efficient (reading files in a list is several times faster than reading them one after the other).

Already discussed with @vuule.

Thanks!

P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.

@miguelusque miguelusque added the feature request New feature or request label Apr 17, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 13, 2024

Hello @shrshi, would you please add python bindings for the JSON reader option json_recovery_mode?

As far as the data source indexes.. we can leave that for a follow-on PR.

@GregoryKimball
Copy link
Contributor

Hello @galipremsagar, it looks like @shrshi is getting pulled deeper into the JSON reader tree algorithms, would you please help us add the json_recovery_mode python binding for NeMo Curator?

@galipremsagar
Copy link
Contributor

Sure, I'll take care of it.

@shrshi
Copy link
Contributor

shrshi commented May 22, 2024

Thank you @galipremsagar

rapids-bot bot pushed a commit that referenced this issue May 24, 2024
Fixes: #15559 

This PR implements `on_bad_lines` in json reader. When `on_bad_lines="recover"`, bad lines are replaced by `<NA>` values.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #15834
@miguelusque
Copy link
Member Author

Thank you all! Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants