Skip to content

Commit

Permalink
[docs] documented crash in case categorical values is bigger max int32 (
Browse files Browse the repository at this point in the history
#1376)

* added checks for categorical features > max_int32

* added tests

* fixed pylint

* removed warnings about overridden categorical features

* Revert "removed warnings about overridden categorical features"

This reverts commit 289a426.

* a little bit more efficient checks

* added notes about max values in categorical features

* Revert "a little bit more efficient checks"

This reverts commit bed8883.

* Revert "fixed pylint"

This reverts commit a229e15.

* Revert "added tests"

This reverts commit 299e001.

* Revert "added checks for categorical features > max_int32"

This reverts commit 2cc7afa.
  • Loading branch information
StrikerRUS authored May 21, 2018
1 parent 3f54429 commit a0c6941
Show file tree
Hide file tree
Showing 9 changed files with 21 additions and 8 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,12 +105,11 @@ Microsoft Open Source Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [[email protected]](mailto:[email protected]) with any additional questions or comments.

Reference Paper
---------------
Reference Papers
----------------

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree)". In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017.

Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tieyan Liu. "[A Communication-Efficient Parallel Algorithm for Decision Tree](http://papers.nips.cc/paper/6380-a-communication-efficient-parallel-algorithm-for-decision-tree)". Advances in Neural Information Processing Systems 29 (NIPS 2016).

Huan Zhang, Si Si and Cho-Jui Hsieh. "[GPU Acceleration for Large-scale Tree Boosting](https://arxiv.org/abs/1706.08359)". arXiv:1706.08359, 2017.

6 changes: 3 additions & 3 deletions docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ Missing Value Handle
Categorical Feature Support
---------------------------

- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot coding solution.
- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot encoding solution.

- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.

- Converting to ``int`` type is needed first, and there is support for non-negative numbers only.
- Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
It is better to convert into continues ranges.

- Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting
Expand Down
8 changes: 8 additions & 0 deletions docs/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,14 @@ LightGBM

--------------

- **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.

- **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
(see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__.). You should convert them into integer range from zero to number of categories first.

--------------

R-package
~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion docs/Features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre
Optimal Split for Categorical Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We often convert the categorical features into one-hot coding.
We often convert the categorical features into one-hot encoding.
However, it is not a good solution in tree learner.
The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.

Expand Down
2 changes: 2 additions & 0 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,8 @@ IO Parameters

- **Note**: only supports categorical with ``int`` type. Index starts from ``0``. And it doesn't count the label column

- **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)

- **Note**: the negative values will be treated as **missing values**

- ``predict_raw_score``, default=\ ``false``, type=bool, alias=\ ``raw_score``, ``is_predict_raw_score``, ``predict_rawscore``
Expand Down
2 changes: 1 addition & 1 deletion docs/Quick-Start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Some columns could be ignored.
Categorical Feature Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~

LightGBM can use categorical features directly (without one-hot coding).
LightGBM can use categorical features directly (without one-hot encoding).
The experiment on `Expo data`_ shows about 8x speed-up compared with one-hot encoding.

For the setting details, please refer to `Parameters <./Parameters.rst>`__.
Expand Down
1 change: 1 addition & 0 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,7 @@ def __init__(self, data, label=None, reference=None,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
params: dict or None, optional (default=None)
Other parameters.
free_raw_data: bool, optional (default=True)
Expand Down
2 changes: 2 additions & 0 deletions python-package/lightgbm/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def train(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping. The model will train until the validation score stops improving.
Requires at least one validation data and one metric. If there's more than one, will check all of them.
Expand Down Expand Up @@ -354,6 +355,7 @@ def cv(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
early_stopping_rounds: int or None, optional (default=None)
Activates early stopping. CV error needs to decrease at least
every ``early_stopping_rounds`` round(s) to continue.
Expand Down
1 change: 1 addition & 0 deletions python-package/lightgbm/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,7 @@ def fit(self, X, y,
If list of int, interpreted as indices.
If list of strings, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas categorical columns are used.
All values should be less than int32 max value (2147483647).
callbacks : list of callback functions or None, optional (default=None)
List of callback functions that are applied at each iteration.
See Callbacks in Python API for more information.
Expand Down

0 comments on commit a0c6941

Please sign in to comment.