-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Automatically re-order columns of dataframe-like inputs #6682
Comments
I think the most complete discussion about reordering features was in the following thread: #4909 (comment). Have you checked it? |
Thanks very much for the write-up! +1 to @StrikerRUS 's comment. I still feel as I did then (#4909 (comment)) ... I'm not convinced that LightGBM should contain code like this. LightGBM model objects (and model files) store feature names in order, so it's always possible to recover the feature names from a model. That means that applications that want this behavior can do something like the following (Python example), which feels pretty lightweight to me: # Booster
bst = lgb.train(train_set=lgb.Dataset(X_train, label=y))
bst.predict(X_pred[bst.feature_name()])
# scikit-learn
reg = lgb.LGBMRegressor().fit(X_train, y)
reg.predict(X_pred[reg.feature_name_]) That's simple because that prediction code can be hard-coded to expect feature names in the model and input dataframe. But it wouldn't be that simple as an official part of LightGBM. For something officially part of the package, we have to account for a wider range of possible situations and raise helpful errors / warnings for some cases. Think about the complexity that would be needed for situations like these:
Things that would change my mindIf many other similarly-popular projects already do this for users (I don't know), I think it's strengthen the case for adding such behavior to LightGBM.
|
You may be interested in a similar discussion in scikit-learn, see scikit-learn/scikit-learn#14251. There, the discussion about entry gate for dataframes, the So the use case here could also be solved by putting together a pipeline with a ColumnTransformer and a LightGBM model. |
Thanks for your extensive summary @jameslamb! Unfortunately, I cannot comment on either of (1), (2) and (3) since I haven't used any of these libraries recently.
In general, I think this is a very valid argument to not include this in the package. Unless there's more information on the three points listed by @jameslamb, I'd also be fine with closing this issue. |
Summary
As far as I can tell, LightGBM currently does not ensure that, when passing a data frame (e.g.
pandas
) to thepredict
method, the feature ordering is the same as it was in the call totrain
. It would be desirable if LightGBM would remember the ordering of features and automatically re-order data frame columns passed topredict
if necessary.Motivation
On a conceptual level, the ordering of data frame columns is typically irrelevant -- columns are referred to by name rather than by index. As a result, any system that accepts data frames as input should, ideally, be agnostic to column ordering.
Description
For any dataframe-like input (i.e.
pandas.DataFrame
,pyarrow.Table
, in the future possiblypolars.DataFrame
), LightGBM should remember the ordering of columns in its call totrain
and later use the stored ordering to re-arrange input data frames in thepredict
method. The time complexity of doing this is expected to be linear in the number of columns while no data should be copied in the process.References
I think the aspect of column ordering has been discussed in a number of issues already. I did not find a concrete proposal to automatically re-order input columns.
The text was updated successfully, but these errors were encountered: