-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] documented crash in case categorical values is bigger max int32 #1376
Conversation
This reverts commit 289a426.
@StrikerRUS current checking is time-consuming ? as it will iterate all int values ? |
@guolinke In my opinion, yes. On my ordinal home computer the results are:
I'll try to find more efficient solution. Also, please share your thoughts about checking test data and data from file. |
@StrikerRUS In which case are we able to have a >int32 categorical? That's a huge cardinality (it should start from 0 anyway). I could think about its usage when targeting users for ads but having 2+billion distinct users is a lot.. |
@StrikerRUS This seems more a preprocessing issue than a LightGBM issue. In R to avoid such issue we use a rule generator we can apply to new datasets: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.prepare_rules.R It transforms the categorical features to numeric features starting from 0 (0 is NA, 1..cardinality are categorical values) so they can be used afterwards as categorical features in LightGBM and other libraries. This issue exists for many other machine learning libraries. |
@Laurae2 The same things are done only for Pandas in Python-package at present. My opinion is that it's not LightGBM issue too, but at least we should document this. |
@StrikerRUS @Laurae2 |
@guolinke I've just found a way to speed it up. But it's practically unnoticeable. So, what's the conclusion? Reject all checks and enhance the documentation, right? |
@StrikerRUS |
@guolinke OK. Then I'll commit a little bit more efficient check and ask you to measure time cost on any real dataset with significant size (unfortunatelly, I can't do it right now because my SSD is completely full). |
Done! @guolinke |
No matter what will bring the results of time cost, notes about max values in categorical features have been added in docs and docstrings in last commit. |
python-package/lightgbm/basic.py
Outdated
""" | ||
Initialize data from a CSR matrix. | ||
""" | ||
if len(csr.indices) != len(csr.data): | ||
raise ValueError('Length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data))) | ||
self.handle = ctypes.c_void_p() | ||
if categorical_indices is not None and len(categorical_indices) != 0 and csr[:, list(categorical_indices)].max() > MAX_INT32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this faster than csr.data.max() ?
@StrikerRUS case 1, without col indices: import numpy as np
import scipy.sparse as sp
import time
MAX_INT32 = (1 << 31) - 1
loop = 200
data = np.random.rand(1000000, 100)*100
start = time.time()
for _ in range(loop):
t = data.max()
end = time.time()
print((end - start)/loop)
data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
t = data.max()
end = time.time()
print((end - start)/loop)
data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
t = data.max()
end = time.time()
print((end - start)/loop) all of them are fast, time cost is
case 2, with col indices: import numpy as np
import scipy.sparse as sp
import time
MAX_INT32 = (1 << 31) - 1
loop = 20
data = np.random.rand(1000000, 100)*100
catcols = list(range(0,10)) + list(range(50,60)) + list(range(80,100))
start = time.time()
for _ in range(loop):
t = data[:,catcols].max()
end = time.time()
print((end - start)/loop)
data = sp.rand(100000,10000,0.01).tocsr()
start = time.time()
for _ in range(loop):
t = data[:,catcols].max()
end = time.time()
print((end - start)/loop)
data = sp.rand(100000,10000,0.01).tocsc()
start = time.time()
for _ in range(loop):
t = data[:,catcols].max()
end = time.time()
print((end - start)/loop) it is about 10x slower for dense and csr:
|
@guolinke Many thanks! So what? Remove all checks and leave only notes in docs, right? |
yeah, sure |
Done! |
Closed #1359.
This PR doesn't fix the case when large categorical value comes from file (data is a string with path to the file). @guolinke is it possible to check on Python side? Also I don't know whether we need to check data for prediction task. I mean, if there are no such large values in train dataset, what is the chance to meet them in test dataset?
In additional, I'm in doubt about the necessity of these time-consuming checks...