[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

gabrielspmoreira · 2021-04-13T22:37:45Z

Is your feature request related to a problem? Please describe.
This feature is related with #641 "Sequential / Session-based recommendation and time series support - Group by sorting values by timestamp".
After grouping, some sequences (e.g. user sessions or time series) might be very long, and for some ML models sequences with maximum (fixed) length are required. So lists truncation is necessary.
As currently List columns are internally represented as sparse vectors, it is currently not possible to use a LambdaOp to truncate the list values to a maximum length.

Describe the solution you'd like
I would like either an option on Groupby op to truncate all aggregated list columns to the same maximum length or an independent TruncateList op that would truncate selected list columns

Describe alternatives you've considered
As a workaround for this problem I am extending the NVT PyTorch dataloader, converting the internal representation of list columns to PyTorch sparse tensors (as shown in #500), converting them to dense tensors (with padding zeros in the right) and then slicing the second dimension of the tensor to the maximum length.
But storing longer sequences than needed in the parquet files is a waste of space and requires workarounds like this in the model size.

benfred · 2021-05-04T17:35:11Z

This is similar to string slicing with string columns - we really need to have list slicing for list columns.

We should prototype this with the offset/values and followup with the cudf team

gabrielspmoreira · 2021-05-04T17:40:29Z

There should be possible to truncate either the start (positive number) or the end of the sessions (negative number).

rjzamora · 2021-05-04T17:45:19Z

A first-order solution is probably to use the list.take method for cudf list columns. That is, you can just make a column of list indices, and call take to get what you want.

It should be possible to get better performance with either a cudf primitive or with other custom solutions, but a quick/simple solution may be a good start.

This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes NVIDIA-Merlin#734

This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes #734

gabrielspmoreira mentioned this issue Apr 13, 2021

[FEA] Sequential / Session-based recommendation and time series support - Group by sorting values by timestamp #641

Closed

benfred added the session-based label Apr 15, 2021

viswa-nvidia added this to the NVTabular v0.6 milestone Apr 26, 2021

benfred added the P0 label May 4, 2021

rjzamora mentioned this issue May 4, 2021

[FEA] Add window functions #740

Open

benfred self-assigned this May 5, 2021

benfred added a commit to benfred/NVTabular that referenced this issue May 11, 2021

Add a list slicing op

fc68729

This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes NVIDIA-Merlin#734

benfred mentioned this issue May 11, 2021

Add a list slicing op #803

Merged

benfred closed this as completed in #803 May 15, 2021

benfred added a commit that referenced this issue May 15, 2021

Add a list slicing op (#803)

f298079

This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes #734

mikemckiernan pushed a commit that referenced this issue Nov 24, 2022

Add a list slicing op (#803)

1d54457

This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes #734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

gabrielspmoreira commented Apr 13, 2021

benfred commented May 4, 2021

gabrielspmoreira commented May 4, 2021

rjzamora commented May 4, 2021

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

Comments

gabrielspmoreira commented Apr 13, 2021

benfred commented May 4, 2021

gabrielspmoreira commented May 4, 2021

rjzamora commented May 4, 2021