[FEA] Parquet dataset creation/sanitation utility #447

rjzamora · 2020-11-20T23:11:10Z

NVTabular is designed with a specific type of dataset in mind. Ideally the dataset will have the following characteristics:

Comprises 1+ parquet files
Each parquet file consists of row-groups around 128MB in size
Each parquet file is large enough to map onto an entire dask_cudf.DataFrame partition. This typically means >=1GB.
All parquet files should be located within a "root" directory, and that directory should contain a global "_metadata" file.
- Note: This "_metadata" file allows the dask_cudf client to produce a DataFrame collection much faster, because all metadata can be accessed from a single file. When this file is not present, the client needs to aggregate footer metadata from all files in the dataset. Having this file also avoids the dreaded "inconsistent schema" error described in [BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429 , because that error only occurs when metadata is being aggregated manually (i.e. when "_metadata" doesn't exist).

My suggestion is to add a dedicated ensure_optimal_dataset utility to NVTabular (the name is not important). By default, this utility would just return suggestions to the user. For example, "Optimal file format is Parquet, please specify a path to output_directory to generate an optimal version of this dataset", or "Dataset is missing a _metadata file, please set replace_metadata_file=True to add one." However, as hinted in the suggestion examples, the utility should also be able to generate a fresh "optimized" parquet dataset.

Notes:

The recent addition of dask#6851 and cudf#6796 means that NVTabular will not need to implement its own logic to generate a "_metadata" file for an existing dataset. However, since dask.dataframe/dask_cudf.to_parquet does not support an option to specify the desired size of each output file, NVTabular would need to implement its own logic to map input dataset files/chunks onto output parquet files.
It is always possible for an input Dataset to be impossible to process in GPU memory (e.g giant compressed csv files, or giant single-row-group parquet files). Therefore, the utility will likely need a CPU-memory path.
It should be possible to support hive-partitioned datasets with a utility like this.

The text was updated successfully, but these errors were encountered:

gabrielspmoreira · 2020-11-23T17:31:51Z

+1 @rjzamora. In cases where the _metadata is not available, it would be super helpful to have the output of your schema checking script, because it helps to identify quickly which input parquet files and columns might have a different schema.

Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?

rjzamora · 2020-11-23T17:57:54Z

Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?

Yes and no. It is certainly possible to handle this efficiently in dask (especially with a cudf backend), but the upstream dask.dataframe implementation of read_parquet (used by NVtabular/dask_cudf) does not have an option for this. Handling this case has been on the very top of my "TODO" list for a few months, but has been blocked by dask#6534 (which I'm really hoping to get merged asap)

gabrielspmoreira mentioned this issue Nov 23, 2020

Documentation on the requirements and troubleshooting for NVTabular input parquet dataset #455

Merged

benfred assigned rjzamora Nov 30, 2020

benfred added the P1 label Dec 7, 2020

rjzamora mentioned this issue Dec 8, 2020

Add validate_dataset and regenerate_dataset methods #484

Merged

9 tasks

benfred closed this as completed in #484 Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet dataset creation/sanitation utility #447

[FEA] Parquet dataset creation/sanitation utility #447

rjzamora commented Nov 20, 2020

gabrielspmoreira commented Nov 23, 2020

rjzamora commented Nov 23, 2020

[FEA] Parquet dataset creation/sanitation utility #447

[FEA] Parquet dataset creation/sanitation utility #447

Comments

rjzamora commented Nov 20, 2020

gabrielspmoreira commented Nov 23, 2020

rjzamora commented Nov 23, 2020