-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should event likelihood be computed using current or last hidden state? #10
Comments
I have the same question as @mistycheney . In this piece of code, the likelihood is calculated using h_i, which has already encoded the i-th event. This would lead the model trying to maximize the likelihood of the event type of the i-th event, and minimizing likelihood of all other event types at this point. Does this explain the dramatic decrease of negative log-likelihood, as presented in the paper (Table 4)? I think maybe this part of the code is not written correctly? |
Exactly, I think this is an error. And there are may different details in the code. This is the function to calculate the log-likelihood: Transformer-Hawkes-Process/Utils.py Line 58 in e1fd7ac
There are several inputs:
Preliminary: Two MasksPlease refer to line 61~65, there are 2 masks that
Event-likelihoodGet the hidden state, calculate the intensity of every type of event at every position, and only extract the truly occurring ones. Please refer to line67~69
Then apply the log function and sum up all. Please refer line72~73
HERE COMES THE FIRST ERROR, note that the i-th event's intensity is: where for the event Non-Event LikelihoodThe code set Monte-Carlo Method as default to calculate the integral of the intensity function: The essential idea is that, during every inter-event time, uniformly sample N points and calculate their intensity, then use their mean as representation intensity during |
Suppose the transformer hidden state at event i is h_i, should the likelihood of this event be computed using h_i or h_{i-1}?
Using h_{i-1} makes more sense to me because this will encourage model to assign high intensity to the true next event, therefore learn to forecast.
But the implementation and the paper seem to be using h_i. The problem is that, since the transformer is given the true event i as part of the input, it can simply learn to output infinitely high intensity for the correct event type in order to maximize the likelihood. Still, the learned model will have no predictive power.
I feel I must have missed something. Any clarification is appreciated. Thanks.
The text was updated successfully, but these errors were encountered: