We selected datasets for benchmarks based upon papers for the area of applying machine learning and deep learning methods for tabular data. These datasets are hosted on different platforms, and depending on where these datasets are coming from, users might need to download them manually. To get a list of datasets, run the following command:
python -m xtime.main dataset list
Each dataset has a name and version (<dataset_name>[:<version>]
). Version is defined in the context of this project
and can represent various preprocessed versions of the original datasets. Two common identifies for versions are:
default
: preprocessed dataset that might contain categorical features.numerical
: dataset where features have been converted to numerical values.
Datasets are automatically downloaded when users call respective APIs (e.g., build_dataset
). It's very likely that
all datasets are cached, and subsequent calls do not trigger the download process. If user environment is behind a
proxy firewall, users might need to configure proxy servers. To configure proxy servers, export two environment
variables HTTP_PROXY
and HTTPS_PROXY
.
OpenML datasets use
minio
library that seems to be not using proxy servers. Thextime
patches this library on the fly to make sure that proxy servers are used.
Datasets hosted on Kaggle platform are not downloaded automatically. Users need to download them manually and copy to appropriate locations. Respective dataset builders provide this information.
The following is the list of datasets as of 2023.03.29 (the doc strings in source code provide more information including references to publications that were the source for us to identify needed preprocessing and hyperparameters search spaces).
Dataset | Task | Input Shape | Output | Versions | Source | Download |
---|---|---|---|---|---|---|
churn_modelling | binary classification | (10000, 10) | num_classes=2 | default, numerical | Kaggle | manual (~/.cache/kaggle/datasets/shrutime ) |
eye_movements | multi-class classification | (10936, 26) | num_classes=3 | default, numerical | OpenML | automatic |
forest_cover_type | multi-class classification | (581012, 54) | num_classes=7 | default, numerical | Scikit-Learn | automatic |
gas_concentrations | multi-class classification | (13910, 129) | num_classes=6 | default, numerical | OpenML | automatic |
gesture_phase_segmentation | multi-class classification | (9873, 32) | num_classes=5 | default, numerical | OpenML | automatic |
rossmann_store_sales | regression | (610235, 29) | num_outputs=1 | default, numerical | Kaggle | manual (~/.cache/kaggle/datasets/rossmann_store_sales ) |
telco_customer_churn | binary classification | (7032, 19) | num_classes=2 | default, numerical | Kaggle | manual (~/.cache/kaggle/datasets/blastchar ) |
year_prediction_msd | regression | (515345, 90) | num_outputs=1 | default | UCI ML Repository | automatic |
wisdm | multi-class classificaiton | varies | num_classes=6 | default | WISDM - WIreless Sensor Data Mining | manual (download v1.1. dataset, extract, and export XTIME_DATASETS_WISDM env variable pointing to the dataset location) |