You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Operating System: Ubuntu server 16.04 64bit
CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz * 2
C++/Python/R version: Python 3.5.2
Error Message:
Python output:
/home/zhang/.local/lib/python3.5/site-packages/lightgbm/basic.py:1038: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['item_id', 'user_id']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[1] 69368 segmentation fault python3 train.py
GDB output:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 train.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 LightGBM::BinMapper::FindBin (this=, values=, num_sample_values=, total_sample_cnt=3, max_bin=255, min_data_in_bin=3, min_split_data=20,
bin_type=LightGBM::CategoricalBin, use_missing=true, zero_as_missing=false) at /home/zhang/lightgbm/LightGBM/src/io/bin.cpp:322
322 if (distinct_values_int[0] == 0) {
[Current thread is 1 (Thread 0x7f22a3957700 (LWP 44144))]
Seems like it's caused by Python int to C++ int conversion error. Large Python int become negative in C++ side. If all values within a DataFrame column are too large, which is common in ID features, these values will be treated as missing values. Then vector distinct_values_int will be empty and distinct_values_int[0] will cause access violation.
Use sklearn.preprocessing,LabelEncoder can solve this problem. But I think this should be fixed or at least throw Python error message instead of segmentation fault since it will cause Python notebook kernel death.
The text was updated successfully, but these errors were encountered:
@StrikerRUS I think we can check this in python side.
@qmick For the categorical feature, use the continued integer from zero is the most efficient way for LightGBM. And we only support 32-bit int in cpp side. When its range exceed 32-bit, using categorical feature is very slow (so as other solutions).
Environment info
Operating System: Ubuntu server 16.04 64bit
CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz * 2
C++/Python/R version: Python 3.5.2
Error Message:
Python output:
GDB output:
Reproducible examples
Steps to reproduce
Possible reason
Seems like it's caused by Python int to C++ int conversion error. Large Python int become negative in C++ side. If all values within a DataFrame column are too large, which is common in ID features, these values will be treated as missing values. Then vector
distinct_values_int
will be empty anddistinct_values_int[0]
will cause access violation.Use
sklearn.preprocessing,LabelEncoder
can solve this problem. But I think this should be fixed or at least throw Python error message instead of segmentation fault since it will cause Python notebook kernel death.The text was updated successfully, but these errors were encountered: