-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Sequential / Session-based recommendation and time series support - Group by sorting values by timestamp #641
Comments
This feature is essential in our time series workflows as well. In our core workflow, the core pseudocode looks like this: df = load_some_df()
df['primary_id'] = flatten_ids(['ids','to','combine'])
df.sort_values('time_id')
df.groupby('primary_id') The changes proposed here would cover this case. |
On another note, on the dataloading side, it will likely also be pertinent to allow for sliding-window access-time tensor building on the collected lists. As an example (ignoring NVT conventions to illustrate the point): list = [10,20,30]
session_loader = nvt.session_loader(list,window=2)
session_loader[0] #Result is [10,20]
session_loader[1] #Result is [20,30] |
Thanks for writing up this issue @gabrielspmoreira - As you know, the transform With the above in mind, I want to clarify: Is this issue requesting a transform operator that does not follow the "partition-wise" convention? That is, is the goal to introduce a new type of transform that cannot be performed with a linear pass over the underlying ddf partitions or would this be a new "statistics" operation (only to be performed at "fit" time)? In order to avoid a departure from the current/simple transform convention, it may actually make sense to separate this feature from the usual cc @benfred (since I'm certainly interested in your thoughts here as well) |
Thanks for your analysis @rjzamora. I understand that a GroupBy op requires full shuffle, because other rows from the same group (e.g. user id or session id) might not be in the same partition, right? In this sense, I understand that it would be easier to implement Another option would be having the GroupBy op as the first step in our pipeline, and then performing the subsequent ops (e.g. Categorify, LopOp, ...) on the list columns. During inference, the task of sequential recommendation and session-based recommendation is to provide a ranked list of items for a user or session, represented by a sequence of his past interactions. So, Triton should receive a batch with the last user interactions and return the recommendation list. The GroupBy op (ordered by the timestamp column), should be performed on inference the same way it was during preprocessing. So I understand that the GroupBy op should be within our workflow. |
Exactly right - I was using the term "partition-wise" to describe a transform that does not require a row/record to move between partitions, but I'm not sure of the best language here.
The only reason I was suggesting a
Note that my current thinking is a bit different (see below), but I am still thinking that
Ah - This is a great point. Then I agree that we ultimately do need something like |
I have created a separate issue #734 for the support of list column truncation to a maximum sequence length |
Is your feature request related to a problem? Please describe.
In order to support session-based / sequence recommendation and also time series use cases, we need NVTabular to be able to group data by some columns (e.g. user id, session id), sorted by another column (e.g. usually timestamp), and aggregate other columns with these different aggregation functions that takes order into account: 'list', 'first', 'last'
Describe the solution you'd like
Describe alternatives you've considered
It might be ok is some NVT ops do not support the resulting list columns, like
Categorify
andStandardize
, because those ops can be done before the grouping.But the
LambdaOp
should support list columns to allow, for example, to extract the length of the lists or truncate the lists.Additional context
This feature can be accomplished by different data frame frameworks:
Pandas
In this example, the preliminary sorting of the rows is respected by the group by, so that the aggregated columns ('item_id') will be sorted by timestamp.
P.s.
cudf / dask_cudf
also supports the 'list' aggregation function. But it does not guarantee that the data frame ordering will be respected, as pandas does.PySpark
This issue was extracted from #355 , which is broader in scope, so that it is implemented independently.
Other related issues: #92 #325
The text was updated successfully, but these errors were encountered: