Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a column with filenames index in cudf.read_json #15960

Open
miguelusque opened this issue Jun 9, 2024 · 6 comments
Open

[FEA] Add a column with filenames index in cudf.read_json #15960

miguelusque opened this issue Jun 9, 2024 · 6 comments
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@miguelusque
Copy link
Member

Hi!

cudf.read_json supports passing multiple files to it, which is much more performant than reading json files individually, and then merging them.

It would be very useful for certain workloads to add a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.

I would suggest adding a new input parameter, named something similar to input_file_indexes_series_name with a default value of None, and, when populated with a string, it would indicate that the indexes of the input files passed to cudf.read_json should be added to a column named as detailed in input_file_indexes_series_name parameter.

Already discussed with @vuule.

Thanks!

P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.
P.S.: This FR was originally raised here.

@miguelusque miguelusque added the feature request New feature or request label Jun 9, 2024
@brandon-b-miller brandon-b-miller added the Python Affects Python cuDF API. label Jun 10, 2024
@brandon-b-miller
Copy link
Contributor

Hi @miguelusque ,
Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

@GregoryKimball
Copy link
Contributor

Thank you @miguelusque for opening up a follow-on issue about this topic. @shrshi, you've had a lot of success in the multi-source improvements you added to #15930. Would you please share your thoughts about the scope for optional source index tracking as an item for future work?

@miguelusque
Copy link
Member Author

miguelusque commented Jun 11, 2024

Hi @miguelusque , Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

Hi @brandon-b-miller ,

Indeed. That would be a new feature not present in Pandas.

Please, let me mention that when we discussed this feature internally, I think we agreed that the most efficient was to only add a column containing the indexes corresponding the files passed to read_json method, and let the user concatenate the file names if needed.

I am happy with a more elaborated API, where you can decide if adding the file names or the file name indexes. The minimum request from our side is to have at least the file name indexes, in order to generate the column with the names by ourselves.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jun 29, 2024

@miguelusque could it also work to expose a metadata item that includes the row count per data source?

@miguelusque
Copy link
Member Author

Hi Gregory, it would depend on how much it would cost, in terms of performance, to reconstruct the dataframe in the desire format. If there is an efficient way to do it from the metadata, that is fine for me.

@mhaseeb123
Copy link
Member

mhaseeb123 commented Jul 2, 2024

We have a similar request for Parquet reader at #15389. We are thinking of adding a vector to table_metadata reporting the number of rows read from each data source unless AST row selection filters are being used in which case, an empty vector is returned due to the added computational overhead. @karthikeyann @shrshi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Status: No status
Development

No branches or pull requests

5 participants