Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Partition output parquet files by a column #642

Closed
gabrielspmoreira opened this issue Mar 8, 2021 · 0 comments · Fixed by #677
Closed

[FEA] Partition output parquet files by a column #642

gabrielspmoreira opened this issue Mar 8, 2021 · 0 comments · Fixed by #677

Comments

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Mar 8, 2021

Is your feature request related to a problem? Please describe.
For a realistic temporal evaluation protocol, it is common to split users interactions/sessions by time windows (e.g. hours, days, weeks, months), so that models can be trained over a time window and evaluated for a future time window.

So, NVTabular should be able to export the parquet files partitioned by a column (which can be a feature extracted from a timestamp column by a LambdaOp, like hour, day, week or month)

Describe the solution you'd like
When the workflow exports the parquet files, they should be organized in folders named after the partition column values (e.g. "interaction_date=2021-03-05").
That is, each folder should contain parquet files whose rows have the same value for the partition column as in their folder name

workflow.transform(dataset).to_parquet('base/path', partition_by='interaction_date')

Additional context
With PySpark, this can be accomplished as follows

interactions_df.write.partitionBy('session_start_date') \
                     .parquet('base/path')

P.s. This issue was extracted from #355 , which is broader in scope, so that it is implemented independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants