-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonreproducible predictInteval results mixing observed and unobserved levels. #124
Comments
I think this is the expected behavior of I admit that both our documentation and vignette are a bit out of date on this one, for example the vignette mistakenly states:
It looks like you'd prefer to return an NA for these observations? Or would you prefer it behave as the vignette describes? |
I changed the title to reflect the main issue, which is the prediction of different intervals with multiple calls for the same query. In my case I want to use the random effects to represent nesting, for example Regarding NAs: I expected the lines above to produce NAs because this leads to the random effect being dropped later in |
On the technical side I need to look more into what Do you happen to have a reproducible example I can execute that illustrates this?
Conceptually here you are right in thinking about the fitted value for the prediction. The mean/median pupil effect will be 0, so we could just rely on Your approach would exclude two types of variation in the outcome that we know about and can define from the model: the variation from the mean of the effect of a given student (which can be estimated by looking at the distribution the random pupil effects are drawn from), and you would also be excluding the variance associated with the precision with which we can estimate the effect of any given pupil. This would make the prediction interval overly confident in terms of what our model can know about the student's likely outcome. By sampling across the distribution of plausible values the model has already seen, we're saying we believe the pupil was probably drawn from the same distribution as the pupils in the model, but capturing that we don't know more than that about them and that our prediction should reflect that uncertainty. This is why I prefer the default to be for each simulation to sample across the distribution for the unobserved pupil and draw an interval encompassing the potential effect of that pupil and the variance in our precision of estimating the pupil effect. Whether that default is a) documented correctly, or b) functioning in all edge cases correctly is something you've helped raise as something I should look into. However, in some cases, I can see where this default is undesirable. In my own work I have dealt with this by fixing unobserved groups to the group closest to the mean when I define |
Here is an example. For clarity: grades come from different tests that each student took during the year, and for whatever reason some grades/pupils have been "lost":
In the output I see that
Maybe I start to understand the issue. Now, regarding the two sources variation that you mention:
The mean effect should be drawn from the distribution of the class. If
I don't think that selecting a level at random is a good way of capturing this. What about introducing an option to specify assumptions? For example the user could assume that the variance is the same for all pupils, so that it can be estimated from the observed pupils. This may not make much sense in this specific example, but would be great if one wants to model experimental measurements taken with the same device/technique (which is actually my case). Alternatively, one could conservatively assume that this variance is at most as large as the largest observed variance. Complicating things even more, one could model the distribution of the pupil variances, and sample from there... just brainstorming a bit here. I understand this is a difficult problem and I really appreciate your help! |
I made quite a bit of confusion with the comments about the sources of variation. Using the same example and assuming that tests are noisy observations of the skills of a student and we are interested in predicting the "true grade" of a student (not what the grade of a future test could be). One would then call Trying to iterate on your current implementation: currently when a new pupil appears, the function would pick an observed one at random and sample |
Thanks for bringing this up. I will write some test conditions and get Thanks for helping me clearly seen an important defect in how we are communicating what |
That sounds very helpful, thanks a lot! |
Hi,
I have trained a LMM of the form
y ~ 1 + (1|a) + (1|a:b)
and now want to predicty
for different inputs. All the inputs are stored in a single dataframe, which I pass topredictInterval
. In my query dataframe alla
are present in the training dataset, whilea:b
are only sometimes present.I noticed that multiple calls to
predictInterval
return significantly different (more than simple sampling noise) intervals in some cases. Surprisingly, the intervals predicted callingpredictInterval
on the entire dataframe and on each row individually are consistently different.a:b
in the query are present in the training dataset, I don't see issues.a:b
in the query are not present in the training dataset, I don't see issues.a:b
in the query are present in the training dataset and some are not, I see the issue.I tried to debug the code of the function, and I believe that the problem lies here: https://github.com/jknowles/merTools/blob/master/R/merPredict.R#L246
With my input,
tmp[keep]
looks like this:Queries 2 and 4 have a corresponding levels in the training set, while query 3 doesn't. I would now expect
tmp$var
to take the valuesa1:b1, NA, a2:b1
, but instead it takes eithera1:b1, a1:b1, a2:b1
ora1:b1, a2:b1, a2:b1
at random. There are two problems here:max.col
will return a value (and thus select a column/level) even if the entire row in zero (instead of returning NA).max.col
by default selects a column at random.I'm relatively new to R, and even more to LMMs, but my understanding is that when a level is missing,
predictInterval
will compute the random effect for another level selected at random from the (observed) levels of other queries. A solution that seems to work for me is to change the guilty line to:Quite dirty, but the idea is to fill the all-zeros rows with NAs, so that
max.col
will return NA for them.Unfortunately I cannot share my actual model/dataset at the moment, but if the issue is unclear I'm happy to try to build a small example to reproduce the problem.
The text was updated successfully, but these errors were encountered: