[RFC] Automatically re-order columns of dataframe-like inputs #6682

borchero · 2024-10-15T10:41:07Z

Summary

As far as I can tell, LightGBM currently does not ensure that, when passing a data frame (e.g. pandas) to the predict method, the feature ordering is the same as it was in the call to train. It would be desirable if LightGBM would remember the ordering of features and automatically re-order data frame columns passed to predict if necessary.

Motivation

On a conceptual level, the ordering of data frame columns is typically irrelevant -- columns are referred to by name rather than by index. As a result, any system that accepts data frames as input should, ideally, be agnostic to column ordering.

Description

For any dataframe-like input (i.e. pandas.DataFrame, pyarrow.Table, in the future possibly polars.DataFrame), LightGBM should remember the ordering of columns in its call to train and later use the stored ordering to re-arrange input data frames in the predict method. The time complexity of doing this is expected to be linear in the number of columns while no data should be copied in the process.

References

I think the aspect of column ordering has been discussed in a number of issues already. I did not find a concrete proposal to automatically re-order input columns.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2024-10-15T11:35:49Z

@borchero

I think the aspect of column ordering has been discussed in a number of issues already. I did not find a concrete proposal to automatically re-order input columns.

I think the most complete discussion about reordering features was in the following thread: #4909 (comment).

Have you checked it?

jameslamb · 2024-10-16T04:04:48Z

Thanks very much for the write-up!

+1 to @StrikerRUS 's comment. I still feel as I did then (#4909 (comment)) ... I'm not convinced that LightGBM should contain code like this.

LightGBM model objects (and model files) store feature names in order, so it's always possible to recover the feature names from a model. That means that applications that want this behavior can do something like the following (Python example), which feels pretty lightweight to me:

# Booster
bst = lgb.train(train_set=lgb.Dataset(X_train, label=y))
bst.predict(X_pred[bst.feature_name()])

# scikit-learn
reg = lgb.LGBMRegressor().fit(X_train, y)
reg.predict(X_pred[reg.feature_name_])

That's simple because that prediction code can be hard-coded to expect feature names in the model and input dataframe.

But it wouldn't be that simple as an official part of LightGBM. For something officially part of the package, we have to account for a wider range of possible situations and raise helpful errors / warnings for some cases. Think about the complexity that would be needed for situations like these:

model has feature names but input does not
input has feature names but model does not
- (or, more precisely, has the LightGBM-default ones like Column_1, Column_2, etc.)
slightly different subsetting or methods of listing column names across pandas / polars / pyarrow (and maybe versions of those libraries)
concerns related to [python-package] check feature names in predict with dataframe (fixes #812) #4909
- would we put this behavior behind a feature flag so that latency-sensitive applications can opt out of it?
- if so, would it be exactly validate_features or would it be a new, dedicated argument like align_columns=True
- if separate, does align_columns=True necessarily imply validate_features=True?

Things that would change my mind

If many other similarly-popular projects already do this for users (I don't know), I think it's strengthen the case for adding such behavior to LightGBM.

do scikit-learn estimators work this way? Is doing this in predict() a part of what scikit-learn expects compatible estimators to do?
do xgboost or catboost work this way?
what about R packages, do they behave this way? (especially {bonsai}, {xgboost}, or {catboost})

lorentzenchr · 2024-11-14T07:29:04Z

You may be interested in a similar discussion in scikit-learn, see scikit-learn/scikit-learn#14251. There, the discussion about entry gate for dataframes, the ColumnTransformer, was resolved by taking care about column order inside the transformer. But note that ColumnTransformer is the only one with this logic, all other transformers/estimators expect the same order of columns in fit and predict.

So the use case here could also be solved by putting together a pipeline with a ColumnTransformer and a LightGBM model.

borchero · 2024-11-26T21:18:08Z

Thanks for your extensive summary @jameslamb! Unfortunately, I cannot comment on either of (1), (2) and (3) since I haven't used any of these libraries recently.

LightGBM model objects (and model files) store feature names in order, so it's always possible to recover the feature names from a model

In general, I think this is a very valid argument to not include this in the package.

Unless there's more information on the three points listed by @jameslamb, I'd also be fine with closing this issue.

jameslamb added the feature request label Oct 16, 2024

jameslamb changed the title ~~RFC: Automatically re-order columns of dataframe-like inputs~~ [RFC] Automatically re-order columns of dataframe-like inputs Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Automatically re-order columns of dataframe-like inputs #6682

[RFC] Automatically re-order columns of dataframe-like inputs #6682

borchero commented Oct 15, 2024

StrikerRUS commented Oct 15, 2024

jameslamb commented Oct 16, 2024 •

edited

Loading

lorentzenchr commented Nov 14, 2024

borchero commented Nov 26, 2024

[RFC] Automatically re-order columns of dataframe-like inputs #6682

[RFC] Automatically re-order columns of dataframe-like inputs #6682

Comments

borchero commented Oct 15, 2024

Summary

Motivation

Description

References

StrikerRUS commented Oct 15, 2024

jameslamb commented Oct 16, 2024 • edited Loading

Things that would change my mind

lorentzenchr commented Nov 14, 2024

borchero commented Nov 26, 2024

jameslamb commented Oct 16, 2024 •

edited

Loading