You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
There are scenarios where we might want the output parquet dataset to be partitioned by a column, after preprocessing with NVTabular.
For example, it is common for Recommender Systems to have users interactions split by days, so that we can run incremental model training and evaluation over the days. But today, the parquet files generated by NVTabular are shuffled and it is not possible to perform temporal train/test set split after NVT preprocessing.
Describe the solution you'd like
A clear and concise description of what you want to happen.
It would be nice if we could set in the workflow a partition column, which would be used to partition the output parquet files. For example, if I select the column Day as the partition column, the output parquet files would be split into folders like (Day=2020-02-01, Day=2020-02-02, ...). The selected column could be a raw column in the input parquet, or a generated column in the workflow (e.g. the Month, which can be extracted with a NVT LambdaOp)
Additional context
According to @rjzamora, "NVTabular doesn't really support this right now, but it seems very doable. The easiest case to support is unshuffled output, where we can just use dask_cudf.to_parquet. If we want to do support nvt-style shuffling, the partition_on code path will require a bit more work."
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
There are scenarios where we might want the output parquet dataset to be partitioned by a column, after preprocessing with NVTabular.
For example, it is common for Recommender Systems to have users interactions split by days, so that we can run incremental model training and evaluation over the days. But today, the parquet files generated by NVTabular are shuffled and it is not possible to perform temporal train/test set split after NVT preprocessing.
Describe the solution you'd like
A clear and concise description of what you want to happen.
It would be nice if we could set in the workflow a partition column, which would be used to partition the output parquet files. For example, if I select the column Day as the partition column, the output parquet files would be split into folders like (Day=2020-02-01, Day=2020-02-02, ...). The selected column could be a raw column in the input parquet, or a generated column in the workflow (e.g. the Month, which can be extracted with a NVT LambdaOp)
Additional context
According to @rjzamora, "NVTabular doesn't really support this right now, but it seems very doable. The easiest case to support is unshuffled output, where we can just use dask_cudf.to_parquet. If we want to do support nvt-style shuffling, the partition_on code path will require a bit more work."
The text was updated successfully, but these errors were encountered: