[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

miguelusque · 2024-04-17T21:29:28Z

Is your feature request related to a problem? Please describe.
Hi!

When passing a list of files to cudf.read_json, it might fail because any of the json objects cannot be parsed because of any reason.

With the current Python API, we cannot distinguish which file failed, so it makes hard to debug those failures when dealing with thousands of jsonl files (for instance, when curating a corpus for LLM training).

Therefore, if the following functionalities would be available:

Create a row with nulls when one of the json objects of a file canot be read, and
Optionally, adding a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.

With those two functionalities, reading thousands of jsonl files in parallel while managing exceptions while reading them would be much efficient (reading files in a list is several times faster than reading them one after the other).

Already discussed with @vuule.

Thanks!

P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.

GregoryKimball · 2024-05-13T16:19:05Z

Hello @shrshi, would you please add python bindings for the JSON reader option json_recovery_mode?

As far as the data source indexes.. we can leave that for a follow-on PR.

GregoryKimball · 2024-05-21T21:47:59Z

Hello @galipremsagar, it looks like @shrshi is getting pulled deeper into the JSON reader tree algorithms, would you please help us add the json_recovery_mode python binding for NeMo Curator?

galipremsagar · 2024-05-22T21:10:30Z

Sure, I'll take care of it.

shrshi · 2024-05-22T23:56:49Z

Thank you @galipremsagar

Fixes: #15559 This PR implements `on_bad_lines` in json reader. When `on_bad_lines="recover"`, bad lines are replaced by `<NA>` values. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #15834

miguelusque · 2024-05-26T09:05:56Z

Thank you all! Much appreciated!

miguelusque added the feature request New feature or request label Apr 17, 2024

GregoryKimball assigned shrshi May 13, 2024

GregoryKimball added this to libcudf May 13, 2024

GregoryKimball assigned galipremsagar and unassigned shrshi May 21, 2024

galipremsagar mentioned this issue May 23, 2024

Implement on_bad_lines in json reader #15834

Merged

3 tasks

rapids-bot bot closed this as completed in #15834 May 24, 2024

miguelusque mentioned this issue Jun 9, 2024

[FEA] Add a column with filenames index in cudf.read_json #15960

Open

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

miguelusque commented Apr 17, 2024

GregoryKimball commented May 13, 2024 •

edited

Loading

GregoryKimball commented May 21, 2024

galipremsagar commented May 22, 2024

shrshi commented May 22, 2024

miguelusque commented May 26, 2024

[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

[FEA] Expose json_recovery_mode in cudf.read_json Python API, and add a column with filenames index #15559

Comments

miguelusque commented Apr 17, 2024

GregoryKimball commented May 13, 2024 • edited Loading

GregoryKimball commented May 21, 2024

galipremsagar commented May 22, 2024

shrshi commented May 22, 2024

miguelusque commented May 26, 2024

GregoryKimball commented May 13, 2024 •

edited

Loading