Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

Closed
gabrielspmoreira opened this issue Apr 13, 2021 · 3 comments · Fixed by #803
Closed

[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734

gabrielspmoreira opened this issue Apr 13, 2021 · 3 comments · Fixed by #803
Assignees

Comments

@gabrielspmoreira
Copy link
Member

Is your feature request related to a problem? Please describe.
This feature is related with #641 "Sequential / Session-based recommendation and time series support - Group by sorting values by timestamp".
After grouping, some sequences (e.g. user sessions or time series) might be very long, and for some ML models sequences with maximum (fixed) length are required. So lists truncation is necessary.
As currently List columns are internally represented as sparse vectors, it is currently not possible to use a LambdaOp to truncate the list values to a maximum length.

Describe the solution you'd like
I would like either an option on Groupby op to truncate all aggregated list columns to the same maximum length or an independent TruncateList op that would truncate selected list columns

Describe alternatives you've considered
As a workaround for this problem I am extending the NVT PyTorch dataloader, converting the internal representation of list columns to PyTorch sparse tensors (as shown in #500), converting them to dense tensors (with padding zeros in the right) and then slicing the second dimension of the tensor to the maximum length.
But storing longer sequences than needed in the parquet files is a waste of space and requires workarounds like this in the model size.

@benfred
Copy link
Member

benfred commented May 4, 2021

This is similar to string slicing with string columns - we really need to have list slicing for list columns.

We should prototype this with the offset/values and followup with the cudf team

@gabrielspmoreira
Copy link
Member Author

There should be possible to truncate either the start (positive number) or the end of the sessions (negative number).

@rjzamora
Copy link
Collaborator

rjzamora commented May 4, 2021

A first-order solution is probably to use the list.take method for cudf list columns. That is, you can just make a column of list indices, and call take to get what you want.

It should be possible to get better performance with either a cudf primitive or with other custom solutions, but a quick/simple solution may be a good start.

@benfred benfred added the P0 label May 4, 2021
@benfred benfred self-assigned this May 5, 2021
benfred added a commit to benfred/NVTabular that referenced this issue May 11, 2021
This adds an operator to slice rows of list columns. This will let us truncate list
column rows to only take the first N or last N items for instance.

Closes NVIDIA-Merlin#734
benfred added a commit that referenced this issue May 15, 2021
This adds an operator to slice rows of list columns. This will let us truncate list
column rows to only take the first N or last N items for instance.

Closes #734
mikemckiernan pushed a commit that referenced this issue Nov 24, 2022
This adds an operator to slice rows of list columns. This will let us truncate list
column rows to only take the first N or last N items for instance.

Closes #734
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants