[FEA] Add a column with filenames index in cudf.read_json #15960

miguelusque · 2024-06-09T15:05:07Z

Hi!

cudf.read_json supports passing multiple files to it, which is much more performant than reading json files individually, and then merging them.

It would be very useful for certain workloads to add a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.

I would suggest adding a new input parameter, named something similar to input_file_indexes_series_name with a default value of None, and, when populated with a string, it would indicate that the indexes of the input files passed to cudf.read_json should be added to a column named as detailed in input_file_indexes_series_name parameter.

Already discussed with @vuule.

Thanks!

P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.
P.S.: This FR was originally raised here.

The text was updated successfully, but these errors were encountered:

brandon-b-miller · 2024-06-10T14:49:17Z

Hi @miguelusque ,
Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

GregoryKimball · 2024-06-10T18:29:43Z

Thank you @miguelusque for opening up a follow-on issue about this topic. @shrshi, you've had a lot of success in the multi-source improvements you added to #15930. Would you please share your thoughts about the scope for optional source index tracking as an item for future work?

miguelusque · 2024-06-11T12:26:43Z

Hi @miguelusque , Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

Hi @brandon-b-miller ,

Indeed. That would be a new feature not present in Pandas.

Please, let me mention that when we discussed this feature internally, I think we agreed that the most efficient was to only add a column containing the indexes corresponding the files passed to read_json method, and let the user concatenate the file names if needed.

I am happy with a more elaborated API, where you can decide if adding the file names or the file name indexes. The minimum request from our side is to have at least the file name indexes, in order to generate the column with the names by ourselves.

GregoryKimball · 2024-06-29T20:16:20Z

@miguelusque could it also work to expose a metadata item that includes the row count per data source?

miguelusque · 2024-07-01T20:27:57Z

Hi Gregory, it would depend on how much it would cost, in terms of performance, to reconstruct the dataframe in the desire format. If there is an efficient way to do it from the metadata, that is fine for me.

mhaseeb123 · 2024-07-02T19:52:44Z

We have a similar request for Parquet reader at #15389. We are thinking of adding a vector to table_metadata reporting the number of rows read from each data source unless AST row selection filters are being used in which case, an empty vector is returned due to the added computational overhead. @karthikeyann @shrshi

miguelusque added the feature request New feature or request label Jun 9, 2024

brandon-b-miller added the Python Affects Python cuDF API. label Jun 10, 2024

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

GregoryKimball added this to libcudf Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a column with filenames index in cudf.read_json #15960

[FEA] Add a column with filenames index in cudf.read_json #15960

miguelusque commented Jun 9, 2024

brandon-b-miller commented Jun 10, 2024

GregoryKimball commented Jun 10, 2024

miguelusque commented Jun 11, 2024 •

edited

Loading

GregoryKimball commented Jun 29, 2024 •

edited

Loading

miguelusque commented Jul 1, 2024

mhaseeb123 commented Jul 2, 2024 •

edited

Loading

[FEA] Add a column with filenames index in cudf.read_json #15960

[FEA] Add a column with filenames index in cudf.read_json #15960

Comments

miguelusque commented Jun 9, 2024

brandon-b-miller commented Jun 10, 2024

GregoryKimball commented Jun 10, 2024

miguelusque commented Jun 11, 2024 • edited Loading

GregoryKimball commented Jun 29, 2024 • edited Loading

miguelusque commented Jul 1, 2024

mhaseeb123 commented Jul 2, 2024 • edited Loading

miguelusque commented Jun 11, 2024 •

edited

Loading

GregoryKimball commented Jun 29, 2024 •

edited

Loading

mhaseeb123 commented Jul 2, 2024 •

edited

Loading