Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Partition the output parquet files by a column #431

Closed
gabrielspmoreira opened this issue Nov 13, 2020 · 1 comment
Closed

[FEA] Partition the output parquet files by a column #431

gabrielspmoreira opened this issue Nov 13, 2020 · 1 comment

Comments

@gabrielspmoreira
Copy link
Member

Is your feature request related to a problem? Please describe.
There are scenarios where we might want the output parquet dataset to be partitioned by a column, after preprocessing with NVTabular.
For example, it is common for Recommender Systems to have users interactions split by days, so that we can run incremental model training and evaluation over the days. But today, the parquet files generated by NVTabular are shuffled and it is not possible to perform temporal train/test set split after NVT preprocessing.

Describe the solution you'd like
A clear and concise description of what you want to happen.

It would be nice if we could set in the workflow a partition column, which would be used to partition the output parquet files. For example, if I select the column Day as the partition column, the output parquet files would be split into folders like (Day=2020-02-01, Day=2020-02-02, ...). The selected column could be a raw column in the input parquet, or a generated column in the workflow (e.g. the Month, which can be extracted with a NVT LambdaOp)

Additional context
According to @rjzamora, "NVTabular doesn't really support this right now, but it seems very doable. The easiest case to support is unshuffled output, where we can just use dask_cudf.to_parquet. If we want to do support nvt-style shuffling, the partition_on code path will require a bit more work."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants