-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add window functions #740
Comments
We have first and last support already in v0.5 - |
As in #734, the list.take method could possibly be used for a first-order solution if something beyond first/last is needed. |
Indeed, we have got the aggregation functions we needed for session-based recommendation within this closed issue #641, which introduced the |
For session-based recommendation, when the session id is not provided in the dataset, we use the idle time between user interactions to split the sessions (usually maximum of 30 min between two consecutive interactions within a session). I understand that I could use the ops.DifferenceLag() partitioned by userid to get the elapsed time between user interactions timestamp. But I am not sure how could I use this new "delta time" feature to generate the same session id for interactions with lower delta time, or to split the sessions in lists as we use the nvt.ops.Groupby(). I don't know if this use case would fit an aggregation or window function, if not I can open a separate issue for this one. |
Regarding the window functions (not specific to session-based recommendation) it is a common feature engineering practice to use lead and lag features for time series and recommender systems in general. As an example, KGMON has used this feature in the Booking.com challenge to have for each training row the last 5 cities in a sequence (e.g. shift(5), shift(4), shift(3, shift(2), shift(1), partitioned by trip), like in this example on cuDF def shift_feature(df, groupby_col, col, offset, nan=-1, colname=''):
df[colname] = df[col].shift(offset)
df.loc[df[groupby_col]!=df[groupby_col].shift(offset), colname] = nan
shift_feature(raw, 'utrip_id_', 'city_id_', 1, NUM_CITIES, f'city_id_lag{1}')
shift_feature(raw, 'utrip_id_', 'city_id_', 2, NUM_CITIES, f'city_id_lag{2}')
... I have used this feature using cuDF to remove consecutive repeated user interactions in the same item, as in the following example: # Sorts the dataframe by session and timestamp, to remove consecutive repetitions
interactions_df = interactions_df.sort_values(['session_id', 'timestamp'])
interactions_df['item_id_past'] = interactions_df['item_id'].shift(1)
interactions_df['session_id_past'] = interactions_df['session_id'].shift(1)
#Keeping only no consectutive repeated in session interactions
interactions_df = interactions_df[~((interactions_df['session_id'] == interactions_df['session_id_past']) & \
(interactions_df['item_id'] == interactions_df['item_id_past']))] In both cases, we did a hack on cuDF compared to the shift() available in Pandas, which supports partitioning by column as in the example of this FEA |
No description provided.
The text was updated successfully, but these errors were encountered: