What happens with missing values during prediction? #2921

AlbertoEAF · 2020-03-17T12:16:11Z

Hello,

Suppose I stick to zero_as_missing=false, use_missing=true. Can you explain what happens during prediction if there are missing values?

I read a bit of the code but those parameters are only used in training, not scoring.

The only reference I saw in the documentation regarding missing values was:

According to those sources, nulls are allocated to the bins that reduce the loss during training.

Is that true? And if so, what if there are no missing values in training?

The text was updated successfully, but these errors were encountered:

guolinke · 2020-03-21T10:53:37Z

of course the prediction have all missing value handles.
refer to the

LightGBM/include/LightGBM/tree.h

Lines 527 to 539 in 9654f16

    
           inline int Tree::GetLeaf(const double* feature_values) const { 
        
             int node = 0; 
        
             if (num_cat_ > 0) { 
        
               while (node >= 0) { 
        
                 node = Decision(feature_values[split_feature_[node]], node); 
        
               } 
        
             } else { 
        
               while (node >= 0) { 
        
                 node = NumericalDecision(feature_values[split_feature_[node]], node); 
        
               } 
        
             } 
        
             return ~node; 
        
           }

AlbertoEAF · 2020-04-01T12:18:38Z

Hello @guolinke I already read a bit and still have some doubts which I can't get through.

During train:

Missing values for numericals are assigned to the side of the ongoing split that most reduces the loss. This occurs for numerical features alone right (1)?
Missing values for categoricals are always alloted to the right side of the split (2)?

During prediction when a missing is found for a numerical field:

If there were missing values during train for that split, we allocate to that side of the split the missing feature, correct (3)?
If there were no missing values in train what do we do? I assume we assign by default to the left side (4)? I'm not sure because I couldn't track all the code through.

During prediction when a missing is found for a categorical field:

We always assign to the right side of the split right? Is there a particular reason (5)?

Thank you!

guolinke · 2020-04-02T00:34:08Z

(1) yes
(2) yes
(3) yes, by default left
(4) It will be converted to zero. refer to

LightGBM/include/LightGBM/tree.h

Lines 260 to 264 in a8c1e0a

    
           if (std::isnan(fval)) { 
        
             if (missing_type != 2) { 
        
               fval = 0.0f; 
        
             } 
        
           }

(5) for categorical features, the split is unorder (both {(1, 3), (2, 4, nan)} and {(2, 4, nan), (1, 3) } are possible for categorical feature, but not for numerical feature). there are, forces the missing to the right side is okay.

AlbertoEAF · 2020-04-03T14:14:11Z

Sorry but I got confused reading (3) & (4).

(3) yes, by default left

Just to be sure, you're saying that during scoring, missing values for a numerical split:

Are assigned to the left side of the split (by default)
Unless there were missing values in train and loss would be smaller by allocating them to the right side of the split.

Correct? :) (a)

(4) It will be converted to zero. refer to

LightGBM/include/LightGBM/tree.h

Lines 260 to 262 in cc6a2f5

    
           if (std::isnan(fval) && missing_type != MissingType::NaN) { 
        
             fval = 0.0f; 
        
           }

(updated to latest code above)

That is only if we choose to not handle missings handle_missing=false or zero_as_missing=true right? (b)

I'm not sure I'm following your notation here:

(5) for categorical features, the split is unorder (both {(1, 3), (2, 4, nan)} and {(2, 4, nan), (1, 3) } are possible for categorical feature, but not for numerical feature). there are, forces the missing to the right side is okay.

I understand that numerical splits are ordered in the sense that given real values a < b < c, I can choose splits=[{a, b}, {c}] or [{a}, {b,c}] but not [{a, c}, {b}], nor [{c}, {a, b}].
Au contraire, for categorical splits, since there's no order to the elements, there is no order in the leafs either, and thus, all 4 split options above are possible for categorical splits, correct? (c)

guolinke · 2020-04-03T16:45:54Z

@AlbertoEAF
for (3) and (4). only if handle_missing=true and zero_as_missing=false, and nan shown in that feature during training, the nan will be handled in prediction time, otherwise it is always converted to zeros.

For (5), yes,

AlbertoEAF · 2020-04-07T18:08:22Z

Thank you @guolinke, already understood those.

However, I don't understand the choice of nan's allocation in the case there were no missing values in train.

For numerical splits, as you explained, values below the split threshold must go to the left side, and values above it to the right. Why then by default do we allocate missing values to the left?

We could have equally chosen to allocate to the right with the same odds of being correct, right?

As for categorical splits, as they have no order, what dictactes if we split a new value during train to the left or right at all?

In that sense why is the right side chosen by default? Should have no meaning in itself right?

guolinke · 2020-04-08T00:47:54Z

@AlbertoEAF
Sorry, I don't really understand your question. I try to answer below.

"by default" doesn't means they are always in that way.
For numerical features, both left side and right side are tested for missing values.
For categorical features, as there are not "left" and "right", and we can put anything to "left". Therefore, the missing values are in right side always.

AlbertoEAF · 2020-04-09T15:34:30Z

Sorry @guolinke , let me try to articulate a bit better! Looks huge I know but I hope it's simpler and more explicit :)

Assumptions and problem statement

My questions concern only the particular case where we didn't have missing values in train but have missing values in scoring. This means the model has no prior missing values distribution information.

Everything I say in the rest of the post assumes the paragraph above !

Seeing a missing value in scoring, the model will place the missing value to:

L(left) side for numericals
R(ight) side for categoricals.

Questions

Missing values with numerical features

For numericals L and R sides are ordered, where non-missing values respect:

max(L) < split_threshold < min(R)

and during scoring we allocate the missing value to the L by default because we have no missing values prior information.

Question # 1:

We could have equally chosen to allocate to the right with the same odds of being correct, right?

Missing values with categorical features

For categoricals, missing values will be placed to the R side at all times in this scenario.

I don't understand however what happens for categoricals in train, and that is what determines the meaning of placing a missing value in scoring to the R side.

How is L vs R side controlled/chosen in train?

Let's assume that:

we're training
trying to split on categorical feature
no missing values
binary classification (classes 0 and 1)
class balance of 1% class-1 and 99% class-0
feature = categorical with categories {A, B, C}, where p(1|A) < p(1|B) < p(1|C)

To find the optimal split, LightGBM sorts by objective function and finds that the optimal split is {A} vs {B or C}.

Question # 2:
Which categories are now placed on each side?

Option a) L={A} and R={B, C} ?
Option b) L={B, C} and R={A} ?

AlbertoEAF · 2020-04-20T13:26:52Z

@guolinke can you clarify?

I think I have a proposal to improve missing values scores, but first I need to know if I understood the current algorithm :)

guolinke · 2020-04-21T06:19:51Z

@AlbertoEAF sorry for the late response, very busy recently...
the missing value handle (unseen in training but seen in test) for categorical feature is easier.
For categorical features, we choose the seen categories as split condition, and always to left. for example, if x == A or x == C or x == F then left, else right. Therefore, it is straightforward to put missing to right.

for numerical features, if not missing is seen in training, the missing value will be converted to zero, and then check it with the threshold. So it is not always the left side.

AlbertoEAF · 2020-04-21T20:50:14Z

Thank you @guolinke , no problem!

Ok, finally got it thanks! :D

Basically in categoricals you are always considering as belonging to the "other" non-split categories.
That makes a lot of sense.

Regarding the numericals, that seems like imputation to the mean but assuming only that large values are less likely than smaller ones. Would it be feasible to apply mean/median imputation based on the train data for that feature? Or even basing it on already computed train statistics like the mode of the histogram ?

Thanks :)

guolinke · 2020-04-22T01:47:50Z

yeah, mean/median is a better solution than zero-fill.
However, I think it is easy for user to fill mean/median as well. Maybe it is not worth for us to add this support, for we may need to record more statistical information in model file.

AlbertoEAF · 2020-04-22T08:56:18Z

I believe you are right, thank you so much for all the clarifications @guolinke :)
I might do a merge request to the docs one of these days so people won't have the same doubts I had and you don't have to explain it again :)

AlbertoEAF mentioned this issue Mar 21, 2020

Cleanup MissingType enum constants #2931

Merged

AlbertoEAF closed this as completed Apr 22, 2020

thomasjpfan mentioned this issue May 8, 2020

ENH Adds Categorical Support to Histogram Gradient Boosting scikit-learn/scikit-learn#16909

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

jameslamb added the feature request label Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What happens with missing values during prediction? #2921

What happens with missing values during prediction? #2921

AlbertoEAF commented Mar 17, 2020 •

edited

Loading

guolinke commented Mar 21, 2020 •

edited

Loading

AlbertoEAF commented Apr 1, 2020

guolinke commented Apr 2, 2020 •

edited

Loading

AlbertoEAF commented Apr 3, 2020 •

edited

Loading

guolinke commented Apr 3, 2020 •

edited

Loading

AlbertoEAF commented Apr 7, 2020 •

edited

Loading

guolinke commented Apr 8, 2020

AlbertoEAF commented Apr 9, 2020 •

edited

Loading

AlbertoEAF commented Apr 20, 2020

guolinke commented Apr 21, 2020

AlbertoEAF commented Apr 21, 2020

guolinke commented Apr 22, 2020

AlbertoEAF commented Apr 22, 2020

What happens with missing values during prediction? #2921

What happens with missing values during prediction? #2921

Comments

AlbertoEAF commented Mar 17, 2020 • edited Loading

guolinke commented Mar 21, 2020 • edited Loading

AlbertoEAF commented Apr 1, 2020

guolinke commented Apr 2, 2020 • edited Loading

AlbertoEAF commented Apr 3, 2020 • edited Loading

guolinke commented Apr 3, 2020 • edited Loading

AlbertoEAF commented Apr 7, 2020 • edited Loading

guolinke commented Apr 8, 2020

AlbertoEAF commented Apr 9, 2020 • edited Loading

Assumptions and problem statement

Questions

Missing values with numerical features

Missing values with categorical features

How is L vs R side controlled/chosen in train?

AlbertoEAF commented Apr 20, 2020

guolinke commented Apr 21, 2020

AlbertoEAF commented Apr 21, 2020

guolinke commented Apr 22, 2020

AlbertoEAF commented Apr 22, 2020

AlbertoEAF commented Mar 17, 2020 •

edited

Loading

guolinke commented Mar 21, 2020 •

edited

Loading

guolinke commented Apr 2, 2020 •

edited

Loading

AlbertoEAF commented Apr 3, 2020 •

edited

Loading

guolinke commented Apr 3, 2020 •

edited

Loading

AlbertoEAF commented Apr 7, 2020 •

edited

Loading

AlbertoEAF commented Apr 9, 2020 •

edited

Loading