-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a column with filenames index in cudf.read_json #15960
Comments
Hi @miguelusque , |
Thank you @miguelusque for opening up a follow-on issue about this topic. @shrshi, you've had a lot of success in the multi-source improvements you added to #15930. Would you please share your thoughts about the scope for optional source index tracking as an item for future work? |
Hi @brandon-b-miller , Indeed. That would be a new feature not present in Pandas. Please, let me mention that when we discussed this feature internally, I think we agreed that the most efficient was to only add a column containing the indexes corresponding the files passed to I am happy with a more elaborated API, where you can decide if adding the file names or the file name indexes. The minimum request from our side is to have at least the file name indexes, in order to generate the column with the names by ourselves. |
@miguelusque could it also work to expose a metadata item that includes the row count per data source? |
Hi Gregory, it would depend on how much it would cost, in terms of performance, to reconstruct the dataframe in the desire format. If there is an efficient way to do it from the metadata, that is fine for me. |
We have a similar request for Parquet reader at #15389. We are thinking of adding a vector to |
Hi!
cudf.read_json
supports passing multiple files to it, which is much more performant than reading json files individually, and then merging them.It would be very useful for certain workloads to add a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.
I would suggest adding a new input parameter, named something similar to
input_file_indexes_series_name
with a default value of None, and, when populated with a string, it would indicate that the indexes of the input files passed tocudf.read_json
should be added to a column named as detailed ininput_file_indexes_series_name
parameter.Already discussed with @vuule.
Thanks!
P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features.
P.S.: This FR was originally raised here.
The text was updated successfully, but these errors were encountered: