Categorical features with large int cause segmentation fault #1359

qmick · 2018-05-05T07:06:05Z

Environment info

Operating System: Ubuntu server 16.04 64bit
CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz * 2
C++/Python/R version: Python 3.5.2

Error Message:

Python output:

/home/zhang/.local/lib/python3.5/site-packages/lightgbm/basic.py:1038: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['item_id', 'user_id']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[1] 69368 segmentation fault python3 train.py

GDB output:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 train.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 LightGBM::BinMapper::FindBin (this=, values=, num_sample_values=, total_sample_cnt=3, max_bin=255, min_data_in_bin=3, min_split_data=20,
bin_type=LightGBM::CategoricalBin, use_missing=true, zero_as_missing=false) at /home/zhang/lightgbm/LightGBM/src/io/bin.cpp:322
322 if (distinct_values_int[0] == 0) {
[Current thread is 1 (Thread 0x7f22a3957700 (LWP 44144))]

Reproducible examples

import lightgbm as lgb
import pandas as pd

data = {'user_id':[4505772604969228686, 2692638157208937547, 5247924392014515924],
       'item_id': [3412720377098676069, 3412720377098676069, 3412720377098676069]}
df = pd.DataFrame(data=data)

lgb_train = lgb.Dataset(df, label=[0, 1, 1])
params = {
    'objective': 'binary',
    'metric': 'binary_logloss'
}

gbm = lgb.train(params, lgb_train, categorical_feature=['user_id', 'item_id'])

Steps to reproduce

Run example above

Possible reason

Seems like it's caused by Python int to C++ int conversion error. Large Python int become negative in C++ side. If all values within a DataFrame column are too large, which is common in ID features, these values will be treated as missing values. Then vector distinct_values_int will be empty and distinct_values_int[0] will cause access violation.

Use sklearn.preprocessing,LabelEncoder can solve this problem. But I think this should be fixed or at least throw Python error message instead of segmentation fault since it will cause Python notebook kernel death.

The text was updated successfully, but these errors were encountered:

guolinke · 2018-05-05T23:23:59Z

@StrikerRUS I think we can check this in python side.

@qmick For the categorical feature, use the continued integer from zero is the most efficient way for LightGBM. And we only support 32-bit int in cpp side. When its range exceed 32-bit, using categorical feature is very slow (so as other solutions).

StrikerRUS · 2018-05-06T00:34:26Z

@guolinke I'll try, but not promise to do it fast.

StrikerRUS self-assigned this May 6, 2018

StrikerRUS mentioned this issue May 18, 2018

[docs] documented crash in case categorical values is bigger max int32 #1376

Merged

StrikerRUS closed this as completed in #1376 May 21, 2018

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical features with large int cause segmentation fault #1359

Categorical features with large int cause segmentation fault #1359

qmick commented May 5, 2018

guolinke commented May 5, 2018

StrikerRUS commented May 6, 2018

Categorical features with large int cause segmentation fault #1359

Categorical features with large int cause segmentation fault #1359

Comments

qmick commented May 5, 2018

Environment info

Error Message:

Reproducible examples

Steps to reproduce

Possible reason

guolinke commented May 5, 2018

StrikerRUS commented May 6, 2018