Skip to content

Luwen-Zhang/tabular_ensemble

Repository files navigation

tabular_ensemble

Code style: black codecov Test Documentation Status

A framework to evaluate various models for tabular regression and classification tasks. The package integrates 25 machine learning (including deep learning) models for tabular prediction tasks from the following well-established model bases:

  • autogluon
    • "LightGBM", "CatBoost", "XGBoost", "Random Forest", "Extremely Randomized Trees", "K-Nearest Neighbors", "Linear Regression", "Neural Network with MXNet", "Neural Network with PyTorch", "Neural Network with FastAI".
  • pytorch_widedeep
    • "TabMlp", "TabResnet", "TabTransformer", "TabNet", "SAINT", "ContextAttentionMLP", "SelfAttentionMLP", "FTTransformer", "TabPerceiver", "TabFastFormer".
  • pytorch_tabular
    • "Category Embedding", "NODE", "TabNet", "TabTransformer", "AutoInt", "FTTransformer".

You are able to implement your own models, data processing pipelines, and datasets under the flexible and well-tested framework for consistent comparisons with baseline models, which is even easier when your own model is based on pytorch.

image

Supported features for all model bases:

  • Data processing
    • Data splitting (training/validation/testing sets)
    • Data imputation
    • Data filtering
    • Data scaling
    • Data augmentation
    • Feature augmentation
    • Feature selection
    • etc.
  • Multi-modal data
  • Loading UCI datasets
  • Data/result analysis
    • Leaderboard
    • Box plot
    • Pair plot
    • Pearson correlation
    • Partial dependency plot (with bootstrapping)
    • Feature importance (Permutation and SHAP)
    • etc.
  • Building models upon other trained models
  • pytorch_lightning-based training for pytorch models
  • Gaussian-process-based Bayesian hyperparameter optimization
  • Cross-validation (including continuing from a cross-validation checkpoint)
  • Saving, loading, and migrating models

The package stands on the shoulder of the giants:

Installation/Usage

A full documentation is available here. For a quick start:

  1. tabular_ensemble can be installed using pypi by running the following command:
pip install tabensemb[torch]

Please use pip install tabensemb instead if you already have torch>=1.12.0 installed. Use pip install tabensemb[test] if you want to run unit tests.

To install from source,

pip install -e .[torch]
  1. (Optional) Run unit tests after installed tabensemb[test]:
cd test
pytest .
  1. Place your .csv or .xlsx file in a data subfolder (e.g., data/sample.csv), and generate a configuration file in a configs subfolder (e.g., configs/sample.py), containing the following content
cfg = {
    "database": "sample",
    "continuous_feature_names": ["cont_0", "cont_1", "cont_2", "cont_3", "cont_4"],
    "categorical_feature_names": ["cat_0", "cat_1", "cat_2"],
    "label_name": ["target"],
}
  1. Run the experiment using the configuration and the data using
python main.py --base sample --epoch 10

where --base refers to the configuration file, and additional arguments (such as --epoch here) refer to those in config/default.py.

See the documentation pages for details.

Citation

If you use this repository, please cite us as:

(Will be updated after released on arXiv or published)