Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Automatically re-order columns of dataframe-like inputs #6682

Open
borchero opened this issue Oct 15, 2024 · 4 comments
Open

[RFC] Automatically re-order columns of dataframe-like inputs #6682

borchero opened this issue Oct 15, 2024 · 4 comments

Comments

@borchero
Copy link
Collaborator

Summary

As far as I can tell, LightGBM currently does not ensure that, when passing a data frame (e.g. pandas) to the predict method, the feature ordering is the same as it was in the call to train. It would be desirable if LightGBM would remember the ordering of features and automatically re-order data frame columns passed to predict if necessary.

Motivation

On a conceptual level, the ordering of data frame columns is typically irrelevant -- columns are referred to by name rather than by index. As a result, any system that accepts data frames as input should, ideally, be agnostic to column ordering.

Description

For any dataframe-like input (i.e. pandas.DataFrame, pyarrow.Table, in the future possibly polars.DataFrame), LightGBM should remember the ordering of columns in its call to train and later use the stored ordering to re-arrange input data frames in the predict method. The time complexity of doing this is expected to be linear in the number of columns while no data should be copied in the process.

References

I think the aspect of column ordering has been discussed in a number of issues already. I did not find a concrete proposal to automatically re-order input columns.

@StrikerRUS
Copy link
Collaborator

@borchero

I think the aspect of column ordering has been discussed in a number of issues already. I did not find a concrete proposal to automatically re-order input columns.

I think the most complete discussion about reordering features was in the following thread: #4909 (comment).

Have you checked it?

@jameslamb
Copy link
Collaborator

jameslamb commented Oct 16, 2024

Thanks very much for the write-up!

+1 to @StrikerRUS 's comment. I still feel as I did then (#4909 (comment)) ... I'm not convinced that LightGBM should contain code like this.

LightGBM model objects (and model files) store feature names in order, so it's always possible to recover the feature names from a model. That means that applications that want this behavior can do something like the following (Python example), which feels pretty lightweight to me:

# Booster
bst = lgb.train(train_set=lgb.Dataset(X_train, label=y))
bst.predict(X_pred[bst.feature_name()])

# scikit-learn
reg = lgb.LGBMRegressor().fit(X_train, y)
reg.predict(X_pred[reg.feature_name_])

That's simple because that prediction code can be hard-coded to expect feature names in the model and input dataframe.

But it wouldn't be that simple as an official part of LightGBM. For something officially part of the package, we have to account for a wider range of possible situations and raise helpful errors / warnings for some cases. Think about the complexity that would be needed for situations like these:

  • model has feature names but input does not
  • input has feature names but model does not
    • (or, more precisely, has the LightGBM-default ones like Column_1, Column_2, etc.)
  • slightly different subsetting or methods of listing column names across pandas / polars / pyarrow (and maybe versions of those libraries)
  • concerns related to [python-package] check feature names in predict with dataframe (fixes #812) #4909
    • would we put this behavior behind a feature flag so that latency-sensitive applications can opt out of it?
    • if so, would it be exactly validate_features or would it be a new, dedicated argument like align_columns=True
    • if separate, does align_columns=True necessarily imply validate_features=True?

Things that would change my mind

If many other similarly-popular projects already do this for users (I don't know), I think it's strengthen the case for adding such behavior to LightGBM.

  • do scikit-learn estimators work this way? Is doing this in predict() a part of what scikit-learn expects compatible estimators to do?
  • do xgboost or catboost work this way?
  • what about R packages, do they behave this way? (especially {bonsai}, {xgboost}, or {catboost})

@jameslamb jameslamb changed the title RFC: Automatically re-order columns of dataframe-like inputs [RFC] Automatically re-order columns of dataframe-like inputs Oct 16, 2024
@lorentzenchr
Copy link
Contributor

You may be interested in a similar discussion in scikit-learn, see scikit-learn/scikit-learn#14251. There, the discussion about entry gate for dataframes, the ColumnTransformer, was resolved by taking care about column order inside the transformer. But note that ColumnTransformer is the only one with this logic, all other transformers/estimators expect the same order of columns in fit and predict.

So the use case here could also be solved by putting together a pipeline with a ColumnTransformer and a LightGBM model.

@borchero
Copy link
Collaborator Author

Thanks for your extensive summary @jameslamb! Unfortunately, I cannot comment on either of (1), (2) and (3) since I haven't used any of these libraries recently.

LightGBM model objects (and model files) store feature names in order, so it's always possible to recover the feature names from a model

In general, I think this is a very valid argument to not include this in the package.


Unless there's more information on the three points listed by @jameslamb, I'd also be fine with closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants