Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add window functions #740

Open
karlhigley opened this issue Apr 15, 2021 · 5 comments
Open

[FEA] Add window functions #740

karlhigley opened this issue Apr 15, 2021 · 5 comments
Labels
enhancement New feature or request P1

Comments

@karlhigley
Copy link
Contributor

No description provided.

@viswa-nvidia viswa-nvidia added this to the NVTabular v0.6 milestone Apr 26, 2021
@benfred benfred added the P0 label May 4, 2021
@benfred
Copy link
Member

benfred commented May 4, 2021

We have first and last support already in v0.5 -

@rjzamora
Copy link
Collaborator

rjzamora commented May 4, 2021

As in #734, the list.take method could possibly be used for a first-order solution if something beyond first/last is needed.

@gabrielspmoreira
Copy link
Member

Indeed, we have got the aggregation functions we needed for session-based recommendation within this closed issue #641, which introduced the nvt.ops.Groupby() op that aggregates interactions by a column (e.g. session or user id), sorts the interactions by another column (e.g., timestamp), and then provide either a "list", of the "first" or "last" element in the list.

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented May 4, 2021

For session-based recommendation, when the session id is not provided in the dataset, we use the idle time between user interactions to split the sessions (usually maximum of 30 min between two consecutive interactions within a session). I understand that I could use the ops.DifferenceLag() partitioned by userid to get the elapsed time between user interactions timestamp. But I am not sure how could I use this new "delta time" feature to generate the same session id for interactions with lower delta time, or to split the sessions in lists as we use the nvt.ops.Groupby(). I don't know if this use case would fit an aggregation or window function, if not I can open a separate issue for this one.

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented May 4, 2021

Regarding the window functions (not specific to session-based recommendation) it is a common feature engineering practice to use lead and lag features for time series and recommender systems in general.
We have the ops.DifferenceLag() op to compute the difference between the current value and the previous value of a feature for a user. But it would be very useful to have 'Lag()' and Lead() ops, which return the actual "past" and "future values" for a given feature, partitioned by a column (e.g. user), which is possible with the usage of the shift(1) or shift(-1) with Pandas partitioned by a column (e.g. user). This sliding window feature is a FEA on the cuDF repo, which is being addressed by this PR, so hopefully that will make it easier to integrated it in NVTabular.

As an example, KGMON has used this feature in the Booking.com challenge to have for each training row the last 5 cities in a sequence (e.g. shift(5), shift(4), shift(3, shift(2), shift(1), partitioned by trip), like in this example on cuDF

def shift_feature(df, groupby_col, col, offset, nan=-1, colname=''):
    df[colname] = df[col].shift(offset)
    df.loc[df[groupby_col]!=df[groupby_col].shift(offset), colname] = nan

shift_feature(raw, 'utrip_id_', 'city_id_', 1, NUM_CITIES, f'city_id_lag{1}')
shift_feature(raw, 'utrip_id_', 'city_id_', 2, NUM_CITIES, f'city_id_lag{2}')
...

I have used this feature using cuDF to remove consecutive repeated user interactions in the same item, as in the following example:

# Sorts the dataframe by session and timestamp, to remove consecutive repetitions
interactions_df = interactions_df.sort_values(['session_id', 'timestamp'])
interactions_df['item_id_past'] = interactions_df['item_id'].shift(1)
interactions_df['session_id_past'] = interactions_df['session_id'].shift(1)
#Keeping only no consectutive repeated in session interactions
interactions_df = interactions_df[~((interactions_df['session_id'] == interactions_df['session_id_past']) & \
                 (interactions_df['item_id'] == interactions_df['item_id_past']))]

In both cases, we did a hack on cuDF compared to the shift() available in Pandas, which supports partitioning by column as in the example of this FEA

@benfred benfred changed the title Add window functions for session-based recs [FEA] Add window functions for session-based recs May 4, 2021
@benfred benfred added P1 and removed P0 labels Jun 7, 2021
@karlhigley karlhigley changed the title [FEA] Add window functions for session-based recs [FEA] Add window functions Jun 7, 2021
@karlhigley karlhigley added enhancement New feature or request and removed session-based labels Jun 7, 2021
@karlhigley karlhigley removed this from the NVTabular v0.6 milestone Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1
Projects
None yet
Development

No branches or pull requests

5 participants