pistachio-mlops

Came across a pistachio dataset for classification. Intent here is to use this to develop an ml-pipeline (kubeflow/vertex-ai) and deploy on gcp

will develop a simple model in a jupyterlab notebook, and use that as a starting point for pipeline development.

Data

Pistachio Image Dataset downloaded from kaggle here

will use the 16 feature version which contains 1718 records across two pistachio types.

pandera for schema/data validation

Notebook

A jupyter notebook containing code used to train an XGBoost model for pistachio classification is here. The notebook loads, validates, and preprocesses the dataset, trains/tunes a model using Bayesian optimisation, and saves and evaluates the final model.

Pipeline

The kubeflow pipeline definition for the model training is in training_pipeline.py. The training pipeline consists of the following steps

loads and splits data (train/test)
validates train and test data against a defined schema
preprocesses train and test data
computes monitoring statistics (Population Stability Index) on training data, storing the results as an artifact
runs hyperparameter tuning using cross-validation/bayesian optimisation
tunes final model based on optimal parameters, storing the model in cloud storage and model registry
evaluates the final model on both the train and test datasets, sotring results (metrics/plots) in cloud storage and as KFP ClassificationMetrics

Components

Components are defined in components.py. Most of the core python code is stored in the docker image ("base_image" - see below), with the component definitions simply importing and invoking the relevant functions. Some components merely shuffle data/artifacts to/from GCP/Vertex AI services, these use the "gcp_aip_image".

Images

base image - Python image holding all the ML functionality
gcp_aip_image - python image with libraries for interacting with gcp services
serving image - python/FastApi image for serving model predictions. Loads model artifact from cloud storage.

Scripts:

build_images.sh - builds all of the above images (locally)
tag_and_push.sh - creates artifact registry tags for each of the local images, and pushes to artifact registry.
test_images.sh - Invokes all of the images - essentially running the training pipeline locally. KFP has a localRunner that does this, so should look into that.

TODO

kfp has a local runner/docker setup for testing components. look at this instead of test_images.sh
XGboost warnings - can disable them in the container code - verbosity 0 or some other flag

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
docs		docs
notebook		notebook
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pistachio-mlops

Data

Notebook

Pipeline

Components

Images

TODO

About

Releases

Packages

Languages

petethegreat/pistachio_mlops

Folders and files

Latest commit

History

Repository files navigation

pistachio-mlops

Data

Notebook

Pipeline

Components

Images

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages