You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NVTabular is designed with a specific type of dataset in mind. Ideally the dataset will have the following characteristics:
Comprises 1+ parquet files
Each parquet file consists of row-groups around 128MB in size
Each parquet file is large enough to map onto an entire dask_cudf.DataFrame partition. This typically means >=1GB.
All parquet files should be located within a "root" directory, and that directory should contain a global "_metadata" file.
Note: This "_metadata" file allows the dask_cudf client to produce a DataFrame collection much faster, because all metadata can be accessed from a single file. When this file is not present, the client needs to aggregate footer metadata from all files in the dataset. Having this file also avoids the dreaded "inconsistent schema" error described in [BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429 , because that error only occurs when metadata is being aggregated manually (i.e. when "_metadata" doesn't exist).
My suggestion is to add a dedicated ensure_optimal_dataset utility to NVTabular (the name is not important). By default, this utility would just return suggestions to the user. For example, "Optimal file format is Parquet, please specify a path to output_directory to generate an optimal version of this dataset", or "Dataset is missing a _metadata file, please set replace_metadata_file=True to add one." However, as hinted in the suggestion examples, the utility should also be able to generate a fresh "optimized" parquet dataset.
Notes:
The recent addition of dask#6851 and cudf#6796 means that NVTabular will not need to implement its own logic to generate a "_metadata" file for an existing dataset. However, since dask.dataframe/dask_cudf.to_parquet does not support an option to specify the desired size of each output file, NVTabular would need to implement its own logic to map input dataset files/chunks onto output parquet files.
It is always possible for an input Dataset to be impossible to process in GPU memory (e.g giant compressed csv files, or giant single-row-group parquet files). Therefore, the utility will likely need a CPU-memory path.
It should be possible to support hive-partitioned datasets with a utility like this.
The text was updated successfully, but these errors were encountered:
+1 @rjzamora. In cases where the _metadata is not available, it would be super helpful to have the output of your schema checking script, because it helps to identify quickly which input parquet files and columns might have a different schema.
Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?
Regarding to the input dataset, is there a way to improve performance in scenarios with many relative small parquet files as input? (e.g. one parquet file for each day, with about 100 MB) ?
Yes and no. It is certainly possible to handle this efficiently in dask (especially with a cudf backend), but the upstream dask.dataframe implementation of read_parquet (used by NVtabular/dask_cudf) does not have an option for this. Handling this case has been on the very top of my "TODO" list for a few months, but has been blocked by dask#6534 (which I'm really hoping to get merged asap)
NVTabular is designed with a specific type of dataset in mind. Ideally the dataset will have the following characteristics:
dask_cudf.DataFrame
partition. This typically means >=1GB."_metadata"
file."_metadata"
file allows thedask_cudf
client to produce a DataFrame collection much faster, because all metadata can be accessed from a single file. When this file is not present, the client needs to aggregate footer metadata from all files in the dataset. Having this file also avoids the dreaded "inconsistent schema" error described in [BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429 , because that error only occurs when metadata is being aggregated manually (i.e. when"_metadata"
doesn't exist).My suggestion is to add a dedicated
ensure_optimal_dataset
utility to NVTabular (the name is not important). By default, this utility would just return suggestions to the user. For example, "Optimal file format is Parquet, please specify a path tooutput_directory
to generate an optimal version of this dataset", or "Dataset is missing a_metadata
file, please set replace_metadata_file=True to add one." However, as hinted in the suggestion examples, the utility should also be able to generate a fresh "optimized" parquet dataset.Notes:
"_metadata"
file for an existing dataset. However, sincedask.dataframe
/dask_cudf.to_parquet
does not support an option to specify the desired size of each output file, NVTabular would need to implement its own logic to map input dataset files/chunks onto output parquet files.The text was updated successfully, but these errors were encountered: