-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Truncate List columns (sparse tensors) - related to the GroupBy op #734
Comments
This is similar to string slicing with string columns - we really need to have list slicing for list columns. We should prototype this with the offset/values and followup with the cudf team |
There should be possible to truncate either the start (positive number) or the end of the sessions (negative number). |
A first-order solution is probably to use the list.take method for cudf list columns. That is, you can just make a column of list indices, and call It should be possible to get better performance with either a cudf primitive or with other custom solutions, but a quick/simple solution may be a good start. |
This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes NVIDIA-Merlin#734
This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes #734
This adds an operator to slice rows of list columns. This will let us truncate list column rows to only take the first N or last N items for instance. Closes #734
Is your feature request related to a problem? Please describe.
This feature is related with #641 "Sequential / Session-based recommendation and time series support - Group by sorting values by timestamp".
After grouping, some sequences (e.g. user sessions or time series) might be very long, and for some ML models sequences with maximum (fixed) length are required. So lists truncation is necessary.
As currently List columns are internally represented as sparse vectors, it is currently not possible to use a LambdaOp to truncate the list values to a maximum length.
Describe the solution you'd like
I would like either an option on Groupby op to truncate all aggregated list columns to the same maximum length or an independent TruncateList op that would truncate selected list columns
Describe alternatives you've considered
As a workaround for this problem I am extending the NVT PyTorch dataloader, converting the internal representation of list columns to PyTorch sparse tensors (as shown in #500), converting them to dense tensors (with padding zeros in the right) and then slicing the second dimension of the tensor to the maximum length.
But storing longer sequences than needed in the parquet files is a waste of space and requires workarounds like this in the model size.
The text was updated successfully, but these errors were encountered: