[FEA] Partition output parquet files by a column #642

gabrielspmoreira · 2021-03-08T19:14:50Z

Is your feature request related to a problem? Please describe.
For a realistic temporal evaluation protocol, it is common to split users interactions/sessions by time windows (e.g. hours, days, weeks, months), so that models can be trained over a time window and evaluated for a future time window.

So, NVTabular should be able to export the parquet files partitioned by a column (which can be a feature extracted from a timestamp column by a LambdaOp, like hour, day, week or month)

Describe the solution you'd like
When the workflow exports the parquet files, they should be organized in folders named after the partition column values (e.g. "interaction_date=2021-03-05").
That is, each folder should contain parquet files whose rows have the same value for the partition column as in their folder name

workflow.transform(dataset).to_parquet('base/path', partition_by='interaction_date')

Additional context
With PySpark, this can be accomplished as follows

interactions_df.write.partitionBy('session_start_date') \
                     .parquet('base/path')

P.s. This issue was extracted from #355 , which is broader in scope, so that it is implemented independently.

The text was updated successfully, but these errors were encountered:

gabrielspmoreira mentioned this issue Mar 8, 2021

[FEA] Session-based recommendation support #355

Closed

benfred added the session-based label Mar 9, 2021

benfred added the P0 label Mar 9, 2021

benfred added P1 and removed P0 labels Mar 9, 2021

rjzamora mentioned this issue Mar 26, 2021

Handle hive-partitioning in NVTabular.dataset.Dataset #677

Merged

karlhigley closed this as completed in #677 Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Partition output parquet files by a column #642

[FEA] Partition output parquet files by a column #642

gabrielspmoreira commented Mar 8, 2021 •

edited by benfred

Loading

[FEA] Partition output parquet files by a column #642

[FEA] Partition output parquet files by a column #642

Comments

gabrielspmoreira commented Mar 8, 2021 • edited by benfred Loading

gabrielspmoreira commented Mar 8, 2021 •

edited by benfred

Loading