Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should event likelihood be computed using current or last hidden state? #10

Open
mistycheney opened this issue Nov 11, 2021 · 2 comments

Comments

@mistycheney
Copy link

mistycheney commented Nov 11, 2021

Suppose the transformer hidden state at event i is h_i, should the likelihood of this event be computed using h_i or h_{i-1}?

Using h_{i-1} makes more sense to me because this will encourage model to assign high intensity to the true next event, therefore learn to forecast.

But the implementation and the paper seem to be using h_i. The problem is that, since the transformer is given the true event i as part of the input, it can simply learn to output infinitely high intensity for the correct event type in order to maximize the likelihood. Still, the learned model will have no predictive power.

I feel I must have missed something. Any clarification is appreciated. Thanks.

@AnthonyChouGit
Copy link

I have the same question as @mistycheney . In this piece of code, the likelihood is calculated using h_i, which has already encoded the i-th event. This would lead the model trying to maximize the likelihood of the event type of the i-th event, and minimizing likelihood of all other event types at this point. Does this explain the dramatic decrease of negative log-likelihood, as presented in the paper (Table 4)? I think maybe this part of the code is not written correctly?

@waystogetthere
Copy link

waystogetthere commented Nov 21, 2022

Exactly, I think this is an error. And there are may different details in the code.

This is the function to calculate the log-likelihood:

def log_likelihood(model, data, time, types):

There are several inputs:

model: the Transformer
data: the raw output of the Model, which needs to go through a linear layer to get the hidden state.
time: the occurring event time, shape: [BATCH, SEQ_LEN]
types: the occurring event type, shape: [BATCH, SEQ_LEN]

Preliminary: Two Masks

Please refer to line 61~65, there are 2 masks that
# non_pad_mask.shape=[BATCH, SEQ_LEN]
indicates the padding position in the batch. This is batch training and sequences with different lengths in one batch are quite common.

# type_mask.shape=[BATCH, SEQ_LEN, NUM_TYPES]
The type_mask includes a one-hot encoding indicating what type occurs at each position.

Event-likelihood

Get the hidden state, calculate the intensity of every type of event at every position, and only extract the truly occurring ones. Please refer to line67~69

all_lambda.shape = [BATCH, SEQ_LEN, NUM_TYPES] different type has different intensity
type_lambda.shape=[BATCH, SEQ_LEN] only extract the ground-truth type

Then apply the log function and sum up all. Please refer line72~73

event_ll.shape=[BATCH]

HERE COMES THE FIRST ERROR, note that the i-th event's intensity is: $f_k(h_i)$, where $f_k$ is the soft-plus function.
It is totally different from the paper:
image

where for the event $t_i$ its intensity should be:
$\lambda(t_i) = f_k(\alpha \frac{t_i-t_{i-1}}{t_i} + \bf w\bf h_{i-1} + b)$
It does not include the 'current' term and uses the current hidden state: $\bf h_i$ instead of the last hidden state: $\bf h_{i-1}$

Non-Event Likelihood

The code set Monte-Carlo Method as default to calculate the integral of the intensity function:
image

The essential idea is that, during every inter-event time, uniformly sample N points and calculate their intensity, then use their mean as representation intensity during $[t_j, t_{j-1}]$.
However, when calculating the intensity, it still uses the current hidden state $\bf h_j$ instead of the last hidden state: $\bf h_{j-1}$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants