From a0c69417ab9a9cb3227ba082f1dbf56301665e55 Mon Sep 17 00:00:00 2001 From: Nikita Titov Date: Tue, 22 May 2018 01:46:44 +0300 Subject: [PATCH] [docs] documented crash in case categorical values is bigger max int32 (#1376) * added checks for categorical features > max_int32 * added tests * fixed pylint * removed warnings about overridden categorical features * Revert "removed warnings about overridden categorical features" This reverts commit 289a426c700ce8934a526cc456a1b1cd5c621db9. * a little bit more efficient checks * added notes about max values in categorical features * Revert "a little bit more efficient checks" This reverts commit bed88830243da21a2db454873c0e308126e05732. * Revert "fixed pylint" This reverts commit a229e1563b0abc1b13de6358577abf90bd529015. * Revert "added tests" This reverts commit 299e001b7550111555b80730d673d4f225cf5f74. * Revert "added checks for categorical features > max_int32" This reverts commit 2cc7afacde7c6366644f6988ccedc344752b68c7. --- README.md | 5 ++--- docs/Advanced-Topics.rst | 6 +++--- docs/FAQ.rst | 8 ++++++++ docs/Features.rst | 2 +- docs/Parameters.rst | 2 ++ docs/Quick-Start.rst | 2 +- python-package/lightgbm/basic.py | 1 + python-package/lightgbm/engine.py | 2 ++ python-package/lightgbm/sklearn.py | 1 + 9 files changed, 21 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 464795c2f073..a913c110f3ce 100644 --- a/README.md +++ b/README.md @@ -105,12 +105,11 @@ Microsoft Open Source Code of Conduct This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. -Reference Paper ---------------- +Reference Papers +---------------- Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017. Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tieyan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree)". Advances in Neural Information Processing Systems 29 (NIPS 2016). Huan Zhang, Si Si and Cho-Jui Hsieh. "[GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359)". arXiv:1706.08359, 2017. - diff --git a/docs/Advanced-Topics.rst b/docs/Advanced-Topics.rst index 617eda16e760..563c8d5263f9 100644 --- a/docs/Advanced-Topics.rst +++ b/docs/Advanced-Topics.rst @@ -15,13 +15,13 @@ Missing Value Handle Categorical Feature Support --------------------------- -- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. - Such an optimal split can provide the much better accuracy than one-hot coding solution. +- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features. + Such an optimal split can provide the much better accuracy than one-hot encoding solution. - Use ``categorical_feature`` to specify the categorical features. Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__. -- Converting to ``int`` type is needed first, and there is support for non-negative numbers only. +- Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647). It is better to convert into continues ranges. - Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting diff --git a/docs/FAQ.rst b/docs/FAQ.rst index d143ff4ab159..cb3dd2c731ad 100644 --- a/docs/FAQ.rst +++ b/docs/FAQ.rst @@ -107,6 +107,14 @@ LightGBM -------------- +- **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM. + +- **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs. + In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features + (see `Microsoft/LightGBM#1359 `__.). You should convert them into integer range from zero to number of categories first. + +-------------- + R-package ~~~~~~~~~ diff --git a/docs/Features.rst b/docs/Features.rst index e3faf6d7e8a1..391f0d43244a 100644 --- a/docs/Features.rst +++ b/docs/Features.rst @@ -63,7 +63,7 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre Optimal Split for Categorical Features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We often convert the categorical features into one-hot coding. +We often convert the categorical features into one-hot encoding. However, it is not a good solution in tree learner. The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy. diff --git a/docs/Parameters.rst b/docs/Parameters.rst index c03829d3613d..be7603769fa7 100644 --- a/docs/Parameters.rst +++ b/docs/Parameters.rst @@ -441,6 +441,8 @@ IO Parameters - **Note**: only supports categorical with ``int`` type. Index starts from ``0``. And it doesn't count the label column + - **Note**: all values should be less than ``Int32.MaxValue`` (2147483647) + - **Note**: the negative values will be treated as **missing values** - ``predict_raw_score``, default=\ ``false``, type=bool, alias=\ ``raw_score``, ``is_predict_raw_score``, ``predict_rawscore`` diff --git a/docs/Quick-Start.rst b/docs/Quick-Start.rst index 539a37027bb4..a6f103dbafc7 100644 --- a/docs/Quick-Start.rst +++ b/docs/Quick-Start.rst @@ -29,7 +29,7 @@ Some columns could be ignored. Categorical Feature Support ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -LightGBM can use categorical features directly (without one-hot coding). +LightGBM can use categorical features directly (without one-hot encoding). The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding. For the setting details, please refer to `Parameters <./Parameters.rst>`__. diff --git a/python-package/lightgbm/basic.py b/python-package/lightgbm/basic.py index b0fef0856dac..ae9b821dd9aa 100644 --- a/python-package/lightgbm/basic.py +++ b/python-package/lightgbm/basic.py @@ -603,6 +603,7 @@ def __init__(self, data, label=None, reference=None, If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify ``feature_name`` as well). If 'auto' and data is pandas DataFrame, pandas categorical columns are used. + All values should be less than int32 max value (2147483647). params: dict or None, optional (default=None) Other parameters. free_raw_data: bool, optional (default=True) diff --git a/python-package/lightgbm/engine.py b/python-package/lightgbm/engine.py index 6dc20e1d40fd..5498d0f97b1f 100644 --- a/python-package/lightgbm/engine.py +++ b/python-package/lightgbm/engine.py @@ -53,6 +53,7 @@ def train(params, train_set, num_boost_round=100, If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify ``feature_name`` as well). If 'auto' and data is pandas DataFrame, pandas categorical columns are used. + All values should be less than int32 max value (2147483647). early_stopping_rounds: int or None, optional (default=None) Activates early stopping. The model will train until the validation score stops improving. Requires at least one validation data and one metric. If there's more than one, will check all of them. @@ -354,6 +355,7 @@ def cv(params, train_set, num_boost_round=100, If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify ``feature_name`` as well). If 'auto' and data is pandas DataFrame, pandas categorical columns are used. + All values should be less than int32 max value (2147483647). early_stopping_rounds: int or None, optional (default=None) Activates early stopping. CV error needs to decrease at least every ``early_stopping_rounds`` round(s) to continue. diff --git a/python-package/lightgbm/sklearn.py b/python-package/lightgbm/sklearn.py index ee637256bc45..69a2a677f48f 100644 --- a/python-package/lightgbm/sklearn.py +++ b/python-package/lightgbm/sklearn.py @@ -341,6 +341,7 @@ def fit(self, X, y, If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify ``feature_name`` as well). If 'auto' and data is pandas DataFrame, pandas categorical columns are used. + All values should be less than int32 max value (2147483647). callbacks : list of callback functions or None, optional (default=None) List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.