The xtime.training
subproject was developed to train gradient boosting decision tree models and optimize their
hyperparameters. It can be used via its Python API, but it was really developed with command line interface in mind. It
was tested with XGBoost, LightGBM, CatBoost and Scikit-Learn implementations. The project uses MLflow to track results
of experiments, and Ray Tune to optimize hyperparameters. Several tabular datasets that are supported out of the box are
the datasets that we found in papers. New datasets are added from time to time (including those beyond tabular datasets,
for instance, timeseries datasets).
git clone https://github.com/HewlettPackard/X-TIME.git
cd ./X-TIME/training
It is assumed that all example commands in this README file run in the xtime.training
root directory (the one that
contains pyproject.toml
and this README file).
This project was tested with several versions of python including 3.9.18
. Some or all of the available features
have been tested in Windows and Linux (Ubuntu) environments. Some notebook features (such as Pandas GUI) are not going
to work on headless Linux systems.
The project provides the environment.yml
file to create a conda
python virtual environment with minimal dependencies
needed to install project dependencies (this project uses poetry
to manage its dependencies). Alternatively, users
can use other mechanism to create python environments (e.g., virtualenv
). With conda
and environment.yml
, the
process looks like this:
conda env create --file ./environment.yml
conda activate xtime-training
The xtime.training
uses poetry
to manage its dependencies. They include mandatory and optional dependencies.
Optional dependencies are all machine learning libraries (such as catboost
) and some dependencies that are used in
datasets (such as openml
and tsfresh
). Optional dependencies are specified as extra
dependencies.
# Install all mandatory and optional dependencies (including XGBoost, CatBoost, LightGBM and all other libraries
# needed for preprocessing data)
poetry install --all-extras
# Optionally, install only needed dependencies (select all or only needed dependency groups)
poetry install --extras "catboost xgboost"
This project defines the following extra dependencies (this list maybe outdated, please, look at pyproject.toml
for
detailed and, possibly, more accurate information):
catboost
CatBoost ML library.xgboost
XGBoost ML library.lightgbm
LightGBM library.timeseries
Packages to create ML datasets for time series data.openml
Library required by several datasets hosted on OpenML site.datasets
: Install packages to ensure all datasets can be loaded (includes [openml
,timeseries
]).models
: Install packages to ensure all models can be loaded (includes [catboost
,xgboost
,lightgbm
]).rapids-12
: Support for GPU-accelerated ML algorithms (random forest). This is for CUDA 12.x.x and python 3.9 only.all
: Install all extra packages (exceptrapids-12
).
I have the following note in one of my Jupyter notebooks (do not recall the reason for this): If on Windows OS, run the following cell. The 2nd command can fail - then go to your python directory and run it. Once done, restart the kernel.
!pip install pypiwin32 !python Scripts\pywin32_postinstall.py -install
# Activate the same Python environment, with conda:
conda activate xtime-training
# Go to the `xtime.training` subproject.
cd ./training
# Update python path to be able to import `xtime.training` modules.
export PYTHONPATH=$(pwd)
# Run Jupyter Lab.
jupyter lab --ip='0.0.0.0' --port=8888 --no-browser --allow-root
There are no special requirements except users may need to provide information about their proxy servers. The xtime
may fail to download some or all datasets if proxy servers are not configured properly. To configure proxy servers,
export two environment variables HTTP_PROXY
and HTTPS_PROXY
. In addition, the Environment variables
section in
this file briefly describes what environment variables users can/should use to enable or configure certain features
of this project.
For a list of datasets, Jupyter notebooks and pre-processing pipelines see this README file. Jupyter notebooks with exploratory data analysis (EDA) are available in the notebooks directory.
The xtime.estimators package contains a number of machine learning models, including CatBoost
,
LightGBM
, XGBoost
and some models from the Scikit-Learn ML library. The xtime
uses Ray Tune for running
hyperparameter search experiments.
As of 2023.03.28, the following models are present (python -m xtime.main models list
):
- catboost
- lightgbm
- dummy
- rf
- xgboost
In addition to these models, the xtime.training
project supports random forest models from NVIDIA's
cuml library. Due to dependency issues, this library is not installed by
default and is not available as an extra dependency (so it can not be installed similar other libraries such as
CatBoost). To enable cuml models (that have rapids-rf
name), the following conditions must be met (this is what
we tested this library with):
- Python 3.9.x runtime.
- NVIDIA GPU with CUDA 12.x runtime.
We tested 23.8.0 version of cuml libraries that are available from NVIDIA's PyPI. Run the
following command in the xtime.training
python environment to install these libraries:
pip install https://pypi.nvidia.com/rmm-cu12/rmm_cu12-23.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
https://pypi.nvidia.com/cudf-cu12/cudf_cu12-23.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
https://pypi.nvidia.com/dask-cudf-cu12/dask_cudf_cu12-23.8.0-py3-none-any.whl \
https://pypi.nvidia.com/cuml-cu12/cuml_cu12-23.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
https://pypi.nvidia.com/raft-dask-cu12/raft_dask_cu12-23.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
https://pypi.nvidia.com/ucx-py-cu12/ucx_py_cu12-0.33.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
https://pypi.nvidia.com/pylibraft-cu12/pylibraft_cu12-23.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Once done, rerun the python -m xtime.main models list
command. You should now see the rapids-rf
model in the list.
Hyperparameters (HPs) are specified with the command line arguments --params
option (as opposed to datasets and
models that are positional arguments). By default, no hyperparameters are used. This means, that ML models will use
default values specified in their frameworks (such as xgboost
). This also means that HPs search requires that users
provide HPs on a command line. There can be multiple sources of hyperparameters, and multiple sources can be used (in
this case HPs are loaded and merged with previously loaded HPs in the order they are specified on a command line).
In the context of this project, the following terms are used in teh context of hyperparameters:
- Hyperparameters (
HParams
): generic term that refers to HP values, search spaces or specifications (see below). - Hyperparameter specification (
HParamsSpec
). This specification defines hyperparameter default value, type and distribution to sample from that is considered as prior distribution for this parameter. - Hyperparameter space (
HParamsSpace
) A hyperparameter search space. See the Ray Tune introduction to search spaces.
In the source code (and to some extent on a command line (see below)) collection of hyperparameters is always represented as a dictionary. Keys are names of hyperparameters. Values, depending on representation, could be values, value specs of prior value distributions.
The following is supported:
--params="auto:{NAME}:{P1}={V1};{P2}={V2};{Pn}={Pn}"
Load hyperparameters from a source named{NAME}
that is supported by thextime.training
project providing additional context usingkey-value
pairs(P1,V1)
,(P2,V2)
, ...,(PN,VN)
. This is reserved for internal hyperparameter recommender engines where{NAME}
is used to select what recommender engine to use, andkey-value
pairs serve as a filter string. The following sources are supported:default
is a default hyperparameter and hyperparameter search space recommender. Users must specify a query containing three mandatory keys -model
,task
andrun_type
. The model's valid values are [lightgbm
,dummy
rf
,catboost
,xgboost
]. The task's valid values are [binary_classification
,multi_class_classification
,regression
]. The run_type's valid values are [train
,hpo
].
--params="mlflow:///MLFLOW_RUN_ID"
Load HPs from MLflow RUN. If this run is a hyperparameter search run, hyperparameters associated with the best run are retrieved.--params="params:lr=0.03;batch_size=128"
Load hyperparameter from a provided string. This is a convenient way to specify HPs on a command line. The string is parsed as a dictionary. Keys and values are separated by=
. Key-value pairs are separated by;
. Values could include functions inmath
package, can use search spaces from raytune
library (see search spaces), and also can use theValueSpec
class to define hyperparameter specification on a command line. When value is a string value, quote and escape its value, e.g.--params=params:tree_method=\"hist\"
.The implementation is not secure and uses pythons
eval
function to parse parameter values.--params="file:///mnt/space/ml/hparams.json"
Load hyperparameters from aJSON
file. The file should contain a dictionary with hyperparameters. Other file types are supported (YAML
,YML
). Thefile
scheme is optional.
The following examples demonstrate how to specify hyperparameters on a command line:
# Provide default hyperparameters
python -m xtime.main hparams query --params="auto:default:model=xgboost;task=binary_classification;run_type=hpo"
# Provide hyperparameters as a single argument
python -m xtime.main hparams query --params="params:lr=0.01;batch=tune.uniform(1, 128);n_estimators=ValueSpec(int, 100, tune.randint(100, 4001))"
# Provide hyperparameters as multiple arguments. Result is the same as above.
python -m xtime.main hparams query --params="params:lr=0.01;batch=tune.uniform(1, 128)" --params="params:n_estimators=ValueSpec(int, 100, tune.randint(100, 4001))"
There is one important limitation. Default (auto) hyperparameters are used when no hyperparameters are provided by users. To continue using them along with user hyperparameters, provide default HPs as the first source of HPs (observe difference in output):
python -m xtime.main hparams query --params="auto:default:model=xgboost;task=binary_classification;run_type=hpo" python -m xtime.main hparams query --params="params:learning_rate=0.4" python -m xtime.main hparams query --params="auto:default:model=xgboost;task=binary_classification;run_type=hpo" --params="params:learning_rate=0.4"
XTIME uses MLflow to track results of ML training and hyperparameter optimization experiments (but not for data
preprocessing stages - basically, preprocessing is done on the fly at data load phase). By default, MLflow uses file
system-based store for tracking experiment artifacts and metadata. The path to this store is usually an mlruns
subdirectory in the current working directory. The following is the example (for Linux, but can easily be adapted for
Windows) how to start a custom instance of MLflow tracking server:
# Maybe run in a separate screen session
screen -S mlflow_server
# This is where run artifacts (e.g., files) will be stored. SQLite database for storing runs metadata will be stored in
# the same directory. If this server will be used across multiple machines, make sure this path is mounted on all these
# machines under the same path. In addition, if multiple users will be using the same MLflow server, they all must have
# read and write access permissions.
# PS - you may need to change permissions / owner of this directory.
export ARTIFACT_STORE=/opt/mlflow
# Adjust host and port parameters as needed.
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
mlflow server --backend-store-uri sqlite:///${ARTIFACT_STORE}/mlruns.db --default-artifact-root=file://${ARTIFACT_STORE}/mlruns --host=0.0.0.0 --port=10000
Then, before running experiments, export the MLflow URI (or set it using MLflow python API):
# You may need to replace 127.0.0.1 with actual server IP address when running MLflow server and experiments on
# different machines.
export MLFLOW_TRACKING_URI=http://127.0.0.1:10000
All entry points are implemented in xtime.main module. Before running any scripts, make sure to
add the ${PROJECT_ROOT}/training
to your PYTHONPATH
:
# Linux
export PYTHONPATH=${PROJECT_ROOT}/training
# Windows
set PYTHONPATH=${PROJECT_ROOT}/training
It is assumed in the following examples that all scripts are executed in ${PROJECT_ROOT}/training
directory.
Start with studying output of these commands:
# High-level commands
python -m xtime.main --help
# Xtime stages related to datasets.
python -m xtime.main dataset --help
# Xtime stages related to ML models.
python -m xtime.main models --help
# Xtime stages related to machine learning experiments.
python -m xtime.main experiment --help
# Explore hyperparameters and hyperparameter search spaces.
python -m xtime.main hparams query --help
The following example shows how to run a hyperparameter search experiment with 100 trials and random search.
# This dataset contains categorical features, so we need to "override" the default tree method for XGBoost.
python -m xtime.main experiment search_hp telco_customer_churn:default xgboost random --params="auto:default:model=xgboost;task=binary_classification;run_type=hpo" --params='params:tree_method="hist"' --num-search-trials=1
Once the above command is executed, it will print out the MLflow run ID. This ID can be used to query MLflow for hyperparameters associated with the best run. The following command will print out the best hyperparameters:
# Assuming that there's an MLflow run with ID 8037062a660847edabfede3ee3f6f4dc exists (hyperparameter search run).
# Get best hyperparameters as a dictionary
python -m xtime.main hparams query --params=mlflow:///8037062a660847edabfede3ee3f6f4dc
# Describe RUN (provide more detailed information)
python -m xtime.main experiment describe summary --run=mlflow:///8037062a660847edabfede3ee3f6f4dc
python -m xtime.main experiment describe best_trial --run=mlflow:///8037062a660847edabfede3ee3f6f4dc
The following list contains some of the environment variables common across all experiments and runs that may be useful for users:
MLFLOW_EXPERIMENT_NAME
Set MLflow experiment name forxtime
runs. If this experiment does not exist, it will be created. Usage example:export MLFLOW_EXPERIMENT_NAME=xgboost
.MLFLOW_TAGS
Provide additional tags for MLflow runs. These tags can later be used to search for specific runs. A common usage is to tag experiment rounds (e.g., round 1, round 2, round 3, etc.). The format for tags is the same as format for hyperparameters withparams:
protocol (but withoutparams:
). Usage example:export MLFLOW_TAGS='round=9'
.
Other variables specific to some datasets:
XTIME_DATASETS_WISDM
Specify directory location of thewisdm
timeseries dataset. Dataset can be downloaded from here: https://www.cis.fordham.edu/wisdm/dataset.php. Usage example:export XTIME_DATASETS_WISDM=/data/datasets/WISDM_ar_v1.1
.NUMBA_DISABLE_CUDA
Some of timeseries datasets builders usetsfresh
library to compute timeseries features. This library depends onstumpy
library that depends onnumba
that detects CUDA runtime and tries to use it if available. When there's a mismatch between numba version and CUDA runtime, an exception is raised. To instructnumba
not to use CUDA, export this environment variable. Usage example:export NUMBA_DISABLE_CUDA=1
.XTIME_DISABLE_PATCH_MINIO
At the time of development, we could not automatically download datasets hosted by OpenML in a corporate environment (behind a firewall) because the underlying python libraryminio
did not (still does not?) support setting proxy URLs. The dataset builders automatically patchminio
whenever dataset is from OpenML. By setting this variable to 1 it is possible to disable this patching (seextime.datasets.DatasetBuilder
class for more details). Usage example:export XTIME_DISABLE_PATCH_MINIO=1
. When this patching is not disabled, thextime
looks for proxy server using the following ordered list of environment variables:https_proxy
,HTTPS_PROXY
,http_proxy
,HTTP_PROXY
. The fist non-empty value will be used.XTIME_SEARCH_HP_PRE_LOAD_DATASET
When running hyperparameter search experiments (search_hp
), it is possible to preload the dataset before starting the search. Ray object store is used to distributed the dataset to workers.
Usage example:export XTIME_SEARCH_HP_PRE_LOAD_DATASET=1
.
Gradient boosting tree estimators (CatBoost, LightGBM and XGBoost) will use GPUs when CUDA_VISIBLE_DEVICES
variable
is set.
train
When running a training experiment, exportCUDA_VISIBLE_DEVICES
environment variable amd point to a single GPU to use.search_hp
When running hyperparameter optimization experiments,CUDA_VISIBLE_DEVICES
can point to multiple devices to be used by the Ray Tune trial scheduler. In addition, to use GPUs users must provide--gpu
option. The value is a floating point number between 0 and 1 that defines how much GPU a single trial can potentially use. For instance, with--gpu 0.1
Ray Tune can schedule 10 parallel trials on a single GPU. When no value provided (--gpu
), it is automatically set to 1 (this is backward compatible with previous implementation when--gpu
was a flag).
This project can be specified as a dependency for a project using poetry
:
[tool.poetry.dependencies]
# Adjust extra packages based on your needs.
# Remove or adjust revision (commit).
xtime-training = { git = "https://github.com/HewlettPackard/X-TIME", rev = "69edb9e", subdirectory = "training", extras=["datasets", "xgboost"] }
or pip
syntax (in requirements.txt
):
"xtime-training[datasets,xgboost] @ git+https://github.com/HewlettPackard/X-TIME@69edb9e#subdirectory=training"
We will be glad to accept new features, machine learning models or datasets, documentation updates and bug fixes. We use the following development guidelines:
- Fork this project.
- Clone your own fork and configure the python environment by installing all dependencies (e.g.,
poetry install --all-extras
). - Create and checkout a new branch (e.g.,
bugfix/format-msg
,feature/new-regression-metric
ordocs/readme-update
). - Fix a bug or implement a new feature, add unit tests.
- Run
ruff
andpyre
(fix warnings and/or errors):ruff format . ruff check --fix . pyre --search-path $(python -c 'import site; print(site.getsitepackages()[0])') check
- Run unit tests
pytest
(orpython -X utf8 -m pytest
if pytest fails due to encoding issues - still to be resolved ). - Commit you changes, push to your fork and create a pull request.
- On some systems, the
pytest
fails for some tests (in particular,test_cli.py -> test_dataset_describe
). There's an open issue to be resolved. Meanwhile, running withpython -X utf8 -m pytest
seems to be resolving it.