[FEA] Partition the output parquet files by a column #431

gabrielspmoreira · 2020-11-13T17:21:51Z

Is your feature request related to a problem? Please describe.
There are scenarios where we might want the output parquet dataset to be partitioned by a column, after preprocessing with NVTabular.
For example, it is common for Recommender Systems to have users interactions split by days, so that we can run incremental model training and evaluation over the days. But today, the parquet files generated by NVTabular are shuffled and it is not possible to perform temporal train/test set split after NVT preprocessing.

Describe the solution you'd like
A clear and concise description of what you want to happen.

It would be nice if we could set in the workflow a partition column, which would be used to partition the output parquet files. For example, if I select the column Day as the partition column, the output parquet files would be split into folders like (Day=2020-02-01, Day=2020-02-02, ...). The selected column could be a raw column in the input parquet, or a generated column in the workflow (e.g. the Month, which can be extracted with a NVT LambdaOp)

Additional context
According to @rjzamora, "NVTabular doesn't really support this right now, but it seems very doable. The easiest case to support is unshuffled output, where we can just use dask_cudf.to_parquet. If we want to do support nvt-style shuffling, the partition_on code path will require a bit more work."

benfred · 2021-05-04T16:30:53Z

handled by #677

gabrielspmoreira mentioned this issue Nov 13, 2020

[BUG] "Schemas are inconsistent" error for parquet files which have the same dtype, but are different in not null setting #429

Closed

benfred closed this as completed May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Partition the output parquet files by a column #431

[FEA] Partition the output parquet files by a column #431

gabrielspmoreira commented Nov 13, 2020

benfred commented May 4, 2021

[FEA] Partition the output parquet files by a column #431

[FEA] Partition the output parquet files by a column #431

Comments

gabrielspmoreira commented Nov 13, 2020

benfred commented May 4, 2021