-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happens with missing values during prediction? #2921
Comments
of course the prediction have all missing value handles. LightGBM/include/LightGBM/tree.h Lines 527 to 539 in 9654f16
|
Hello @guolinke I already read a bit and still have some doubts which I can't get through. During train:
During prediction when a missing is found for a numerical field:
During prediction when a missing is found for a categorical field:
Thank you! |
(1) yes LightGBM/include/LightGBM/tree.h Lines 260 to 264 in a8c1e0a
(5) for categorical features, the split is unorder (both {(1, 3), (2, 4, nan)} and {(2, 4, nan), (1, 3) } are possible for categorical feature, but not for numerical feature). there are, forces the missing to the right side is okay. |
Sorry but I got confused reading (3) & (4).
Just to be sure, you're saying that during scoring, missing values for a numerical split:
Correct? :) (a)
LightGBM/include/LightGBM/tree.h Lines 260 to 262 in cc6a2f5
(updated to latest code above) That is only if we choose to not handle missings I'm not sure I'm following your notation here:
|
@AlbertoEAF For (5), yes, |
Thank you @guolinke, already understood those. However, I don't understand the choice of nan's allocation in the case there were no missing values in train. For numerical splits, as you explained, values below the split threshold must go to the left side, and values above it to the right. Why then by default do we allocate missing values to the left?
As for categorical splits, as they have no order, what dictactes if we split a new value during train to the left or right at all?
|
@AlbertoEAF "by default" doesn't means they are always in that way. |
Sorry @guolinke , let me try to articulate a bit better! Looks huge I know but I hope it's simpler and more explicit :) Assumptions and problem statementMy questions concern only the particular case where we didn't have missing values in train but have missing values in scoring. This means the model has no prior missing values distribution information. Everything I say in the rest of the post assumes the paragraph above ! Seeing a missing value in scoring, the model will place the missing value to:
QuestionsMissing values with numerical featuresFor numericals L and R sides are ordered, where non-missing values respect:
and during scoring we allocate the missing value to the L by default because we have no missing values prior information. Question # 1:
Missing values with categorical featuresFor categoricals, missing values will be placed to the R side at all times in this scenario. I don't understand however what happens for categoricals in train, and that is what determines the meaning of placing a missing value in scoring to the R side. How is L vs R side controlled/chosen in train?Let's assume that:
To find the optimal split, LightGBM sorts by objective function and finds that the optimal split is {A} vs {B or C}. Question # 2:
|
@guolinke can you clarify? I think I have a proposal to improve missing values scores, but first I need to know if I understood the current algorithm :) |
@AlbertoEAF sorry for the late response, very busy recently... for numerical features, if not missing is seen in training, the missing value will be converted to zero, and then check it with the threshold. So it is not always the left side. |
Thank you @guolinke , no problem! Ok, finally got it thanks! :D Basically in categoricals you are always considering as belonging to the "other" non-split categories. Regarding the numericals, that seems like imputation to the mean but assuming only that large values are less likely than smaller ones. Would it be feasible to apply mean/median imputation based on the train data for that feature? Or even basing it on already computed train statistics like the mode of the histogram ? Thanks :) |
yeah, mean/median is a better solution than zero-fill. |
I believe you are right, thank you so much for all the clarifications @guolinke :) |
Hello,
Suppose I stick to
zero_as_missing=false
,use_missing=true
. Can you explain what happens during prediction if there are missing values?I read a bit of the code but those parameters are only used in training, not scoring.
The only reference I saw in the documentation regarding missing values was:
According to those sources, nulls are allocated to the bins that reduce the loss during training.
Is that true? And if so, what if there are no missing values in training?
The text was updated successfully, but these errors were encountered: