This is your new Kedro project configured according to QuickStart ML principles. Modify this README as you develop your project, for now you will find here some basic info that you need to get started. For more detailed assistance please refer to the Kedro documenation and QuickStart ML Blueprints.
Additionally to a blank Kedro template it features technological stack used in QuickStart ML approach, such as:
- Poetry
- pre-commit hooks
- Dockerfile setup
- VSCode Dev Containers for ease of development
- MLFlow integration
- GCP VertexAI Kedro integration with integration to other platforms to be added
Apart from that, there are no pre-implemented nodes or pipelines here. For blueprints showing different machine learning use cases, please go to the main QuickStart ML Blueprints repo and feel free to take as much as you need from our examples.
In order to get the best out of the template:
- Don't remove any lines from the
.gitignore
file we provide - Make sure your results can be reproduced by following a data engineering convention
- Don't commit data to your repository
- Don't commit any credentials or your local configuration to your repository. Keep all your credentials and local configuration in
conf/local/
Below there are short instructions on how to get the environment for your new project up and running. Detailed version with some remarks and specific cases described are available in QuickStart ML Blueprints documentation.
- Create service principal with contributor role and write down the appid, password and tenant
az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/b3427a92-a6bb-4354-9822-98b9a61bbe58"
- Use these values to fill in
arm_client_id
,arm_secret_id
andarm_tenant_id
insecret.tfvars
respectively - Initialize terraform
terraform init
- With
terraform
as current directory, apply terraform script
terraform apply --var-file secret.tfvars
This approach facilitates use of VSCode devcontainers. It is the easiest way to set up the development environment.
Prerequisites:
- VSCode with Remote development extension
- Docker with
/workspaces
entry inDocker Desktop > Preferences > Resources > File Sharing
Setting up:
- Clone this repository and open it in a container.
- You're good to go!
The project is using pyenv Python version management. It lets you easily install and switch between multiple versions of Python. To install pyenv, follow these steps for your operating system.
To install a specific Python version use this command:
pyenv install 3.8.16
pyenv shell 3.8.16
It is recommended to create a virtual environment in your project:
python -m venv venv
source ./venv/bin/activate
To install libraries declared in the pyproject.toml you need to have Poetry
installed. Install it from here and then run this command:
poetry install
To add and install dependencies with:
# dependencies
poetry add <package_name>
# dev dependencies
poetry add -D <package_name>
You can run your Kedro project with:
kedro run
To run a specific pipeline:
kedro run -p "<PIPELINE_NAME>"
- visualizes Kedro pipelines in an informative way
- to run,
kedro viz --autoreload
inside project's directory - this will run a server on
http://127.0.0.1:4141
- lightweight integration of
MLflow
insideKedro
projects - configuration can be specified inside
conf/<ENV>/mlflow.yml
file - by default, experiments are saved inside
mlruns
local directory - to see all the local experiments, run
kedro mlflow ui
- Login and configure workspace
az account set --subscription <subscription>
az configure --defaults workspace=<workspace> group=<resource-group> location=<location>
- You can get tracking URI using
az ml workspace
command
az ml workspace show --query mlflow_tracking_uri
- Place this URI in
mlflow.yml
underserver.mlflow_tracking_uri
Keyvault access policy storage container with project name (container name with - not _ setup az login inside container ask for subscription in starter to fill in README and setup.sh script
Setup flow:
- build devcontainer
.devcontainer/setup.sh
- azure is logged in
- azure subscription activated
- azure ml settings activated for mlflow
Need from starter:
ask for subscription & location.
auto-fill terraform: use naming convention for workspace and resource group
Auto-align workspace/rg name in .devcontainer/setup.sh