Came across a pistachio dataset for classification. Intent here is to use this to develop an ml-pipeline (kubeflow/vertex-ai) and deploy on gcp
will develop a simple model in a jupyterlab notebook, and use that as a starting point for pipeline development.
Pistachio Image Dataset downloaded from kaggle here
will use the 16 feature version which contains 1718 records across two pistachio types.
pandera for schema/data validation
A jupyter notebook containing code used to train an XGBoost model for pistachio classification is here. The notebook loads, validates, and preprocesses the dataset, trains/tunes a model using Bayesian optimisation, and saves and evaluates the final model.
The kubeflow pipeline definition for the model training is in training_pipeline.py. The training pipeline consists of the following steps
- loads and splits data (train/test)
- validates train and test data against a defined schema
- preprocesses train and test data
- computes monitoring statistics (Population Stability Index) on training data, storing the results as an artifact
- runs hyperparameter tuning using cross-validation/bayesian optimisation
- tunes final model based on optimal parameters, storing the model in cloud storage and model registry
- evaluates the final model on both the train and test datasets, sotring results (metrics/plots) in cloud storage and as KFP ClassificationMetrics
Components are defined in components.py. Most of the core python code is stored in the docker image ("base_image" - see below), with the component definitions simply importing and invoking the relevant functions. Some components merely shuffle data/artifacts to/from GCP/Vertex AI services, these use the "gcp_aip_image".
- base image - Python image holding all the ML functionality
- gcp_aip_image - python image with libraries for interacting with gcp services
- serving image - python/FastApi image for serving model predictions. Loads model artifact from cloud storage.
Scripts:
- build_images.sh - builds all of the above images (locally)
- tag_and_push.sh - creates artifact registry tags for each of the local images, and pushes to artifact registry.
- test_images.sh - Invokes all of the images - essentially running the training pipeline locally. KFP has a localRunner that does this, so should look into that.
- kfp has a local runner/docker setup for testing components. look at this instead of test_images.sh
- XGboost warnings - can disable them in the container code - verbosity 0 or some other flag