Model training for the Metal Nut Data Set

Automating Visual Inspections with AI on Red Hat OpenShift Data Science - Hands-on tutorial

Prerequisites

OpenShift with GPU worker nodes

GPU worker node are not mandatory, but recommended when you would like to train the model by yourself.

Redhatters can order the "NVIDIA GPU Operator Red Hat OpenShift Container Platform 4 Workshop". Please be aware of the costs and shutdown the service.
Upgrade OpenShift to 4.11.x
Install the Node Feature Discovery (NFD) Operator.
Install the NVIDIA GPU Operator.
Create the ClusterPolicy instance.
Verify the successful installation of the NVIDIA GPU Operator.

Deploy the RHODS Operator

Follow "Installing OpenShift Data Science on OpenShift Container Platform"

Provide S3 Storage

Model Serving requires a S3 bucket with an ACCESS_KEY and SECRET_KEY. In case you don't have S3 already available (e.g. ODF or on AWS) you can deploy Minio on your OpenShift Cluster:

oc apply -f https://raw.githubusercontent.com/mamurak/os-mlops/master/manifests/minio/minio.yaml

Minio is deployed to the project/namespace minio.
Launch the minio web UI (see Route) and create a bucket (e.g. manu-vi).

Setup a RHODS workbench

Create new RHODS workbench for Ultralytics Pytorch Yolov5

Log in to the OpenShift web console
Launch RHODS via the application launcher (nine-dots) -> Red Hat OpenShift Data Science
Create a new Data Science project -> Create data science project.

If you have your own OpenShift cluster, you can name the project 'manuela-visual-inspection'. If not add your initials. E.g. 'manu-vi-stb'. Don't choose to long names, because project and model server names are internally concatenated, which could lead into problems.
- Name: manu-vi
- Resource name: manu-vi
- -> Create.
Create a data connection with your S3 configuration
- Data connections -> Add Data connections.
- Name: manu-vi
- AWS_ACCESS_KEY_ID: minio
- AWS_SECRET_ACCESS_KEY: minio123
- AWS_S3_ENDPOINT: http://minio-service.minio.svc.cluster.local:9000
- AWS_DEFAULT_REGION: does not matter, is ignored
- AWS_S3_BUCKET: manu-vi
Create new RHODS workbench
- Workbenches -> Create workbench.
- Name: manu-vi
- Image: CUDA (assuming you have a cluster with a Nvidia GPU)
- Deployment size: Small
- Number of GPUs: 1
- Cluster storage: Create new cluster storage
- Data connection: Use existing data connection -> manu-vi
- -> Create workbench.
Note, in case the workbench does not start, please check if LimitRanges block the pod. Find the created project in the OpenShift console, navigate to Administration -> LimitRanges and delete the LimitRange that was auto created
Open the workbench and clone https://github.com/stefan-bergstein/manuela-visual-inspection.git (there are at least 4 ways to do this - find out the approach you like)

Model training

Optionally, Extend shared memory for your notebook

PyTorch is internally using shared memory (/dev/shm) to exchange data between its internal worker processes. However, default container engine configurations limit this memory to the bare minimum, which can make the process exhaust this memory and crash. The solution is to manually increase this memory by mounting a emptyDir volume or to run the model training without PyTorch workers (which will slowdown the training).

Patch the Notebook as described here: README.md You might have to adapt the namespace/name to match your setup.
If the workbench is not automatically restarted, stop and start your workbench in your Data Science Project.

Explore and run the model training notebook

Navigate to manuela-visual-inspection/ml/pytorch and open Manuela_Visual_Inspection_Yolov5_Model_Training.ipynb
Explore or explain and run cells step by step
- Setup and test the Ultralytics Yolov5 toolkit
- Inspect training dataset (image and labels)
- Model training
  - Model training can take ~30 minutes or more even with GPUs. You could jump to Model Serving, use a pre-trained model and come back later.
- Model validation
- Convert model to onnx format and upload it to S3

Note: the notebook's cells contain output messages from a previous successful run. This is so you could explain the demo without actually run anything (i.e. no GPUs required, etc...). But in order to run the demo yourself, you need to run every cell successfully once, even if the outputs might suggest is has already run.

Model Serving

Optionally, download a pre-trained manu-vi model and upload it to S3

In case you have to not had the time or resources to train the model by yourself, you can download a pre-trained manu-vi model and upload it to your S3 bucket.

Open your workbench (with your manu-vi data connection)
Navigate to manuela-visual-inspection/ml/pytorch and open Upload_pretrained_model.ipynb
Run the notebook to upload the model

Configure RHODS model serving

Create model server in your data science project
- Models and model servers -> Configure server
- Number of model server replicas to deploy: 1
- Model server size: Small
- Model route: -> Check/Enable 'Make deployed models available through an external route'
- Token authorization -> Uncheck/Disable 'Require token authentication'
- -> Configure
Deploy the trained model -> Deploy Model
- Model Name: manu-vi
- Model framework: onnx - 1
- Model location: Existing data connection
- Name: manu-vi
- Folder path: manu-vi-best.onnx
- -> Deploy
Wait until Status is green / loaded
- Copy and save the inference URL

Test inferencing with a REST API call

Show how an ML REST call could be integrated into your 'intelligent' Python application.

Return to the workbench
Navigate to manuela-visual-inspection/ml/pytorch and open Manuela_Visual_Inspection_Yolov5_Infer_Rest.ipynb
Study or explain and run cells step by step
- Please don´t forget to update the inferencing URL
Demonstrate cool inferencing with RHODS :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Model training for the Metal Nut Data Set

Prerequisites

OpenShift with GPU worker nodes

Deploy the RHODS Operator

Provide S3 Storage

Setup a RHODS workbench

Create new RHODS workbench for Ultralytics Pytorch Yolov5

Model training

Optionally, Extend shared memory for your notebook

Explore and run the model training notebook

Model Serving

Optionally, download a pre-trained manu-vi model and upload it to S3

Configure RHODS model serving

Test inferencing with a REST API call

Files

README.md

Latest commit

History

README.md

File metadata and controls

Model training for the Metal Nut Data Set

Prerequisites

OpenShift with GPU worker nodes

Deploy the RHODS Operator

Provide S3 Storage

Setup a RHODS workbench

Create new RHODS workbench for Ultralytics Pytorch Yolov5

Model training

Optionally, Extend shared memory for your notebook

Explore and run the model training notebook

Model Serving

Optionally, download a pre-trained manu-vi model and upload it to S3

Configure RHODS model serving

Test inferencing with a REST API call