Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Parquet dataset creation/sanitation utility #447

Closed
rjzamora opened this issue Nov 20, 2020 · 2 comments · Fixed by #484
Closed

[FEA] Parquet dataset creation/sanitation utility #447

rjzamora opened this issue Nov 20, 2020 · 2 comments · Fixed by #484
Assignees
Labels

Comments

@rjzamora
Copy link
Collaborator

NVTabular is designed with a specific type of dataset in mind. Ideally the dataset will have the following characteristics:

  1. Comprises 1+ parquet files
  2. Each parquet file consists of row-groups around 128MB in size
  3. Each parquet file is large enough to map onto an entire dask_cudf.DataFrame partition. This typically means >=1GB.
  4. All parquet files should be located within a "root" directory, and that directory should contain a global "_metadata" file.

My suggestion is to add a dedicated ensure_optimal_dataset utility to NVTabular (the name is not important). By default, this utility would just return suggestions to the user. For example, "Optimal file format is Parquet, please specify a path to output_directory to generate an optimal version of this dataset", or "Dataset is missing a _metadata file, please set replace_metadata_file=True to add one." However, as hinted in the suggestion examples, the utility should also be able to generate a fresh "optimized" parquet dataset.

Notes:

  • The recent addition of dask#6851 and cudf#6796 means that NVTabular will not need to implement its own logic to generate a "_metadata" file for an existing dataset. However, since dask.dataframe/dask_cudf.to_parquet does not support an option to specify the desired size of each output file, NVTabular would need to implement its own logic to map input dataset files/chunks onto output parquet files.
  • It is always possible for an input Dataset to be impossible to process in GPU memory (e.g giant compressed csv files, or giant single-row-group parquet files). Therefore, the utility will likely need a CPU-memory path.
  • It should be possible to support hive-partitioned datasets with a utility like this.
@gabrielspmoreira
Copy link
Member

+1 @rjzamora. In cases where the _metadata is not available, it would be super helpful to have the output of your schema checking script, because it helps to identify quickly which input parquet files and columns might have a different schema.

Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?

@rjzamora
Copy link
Collaborator Author

Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?

Yes and no. It is certainly possible to handle this efficiently in dask (especially with a cudf backend), but the upstream dask.dataframe implementation of read_parquet (used by NVtabular/dask_cudf) does not have an option for this. Handling this case has been on the very top of my "TODO" list for a few months, but has been blocked by dask#6534 (which I'm really hoping to get merged asap)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants