Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement model registry inference service reconciliation #135

Conversation

lampajr
Copy link
Contributor

@lampajr lampajr commented Jan 4, 2024

Fixes https://github.com/opendatahub-io/model-registry/issues/249

Description

Implement the model registry InferenceService reconciliation, the implemented workflow is the opposite of the one proposed in #124. Here the direction goes from Cluster to ModelRegistry.

The new reconciler will monitor InferenceService CRs having pre-defined labels, based on those labels will sync the model registry by keeping track of every deployment that occurred in the cluster.
Then will update the InferenceService CR by linking it to the model registry record using a specific label.

How Has This Been Tested?

e2e tests

Added e2e tests under test/e2e/ folder, currently I am running those tests using Kind cluster:

  1. Setup cluster: kind create cluster --config test/config/kind-e2e-config.yaml
  2. Create temporary dev tag image: IMG=quay.io/$USER/odh-model-controller:$(git rev-parse HEAD)
  3. Build and tag dev image: make IMG=$IMG docker-build
  4. Add local image to cluser: kind load docker-image $IMG
  5. Deploy the controller: make IMG=${IMG} deploy-dev
  6. Run e2e tests: make e2e-test

NOTE: These steps could be (hopefully) easily converted in a GHA if we decide to go with that.

manual test

Setup cluster

  1. Install ODH 2.4.0
  2. Install ODH component using the following DataScienceCluster CRD
kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
  name: default
  labels:
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: opendatahub-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: opendatahub-operator
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Managed
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: https://github.com/opendatahub-io/kserve/tarball/master
          - contextDir: config
            uri: https://github.com/lampajr/odh-model-controller/tarball/lampajr20231219_gh249_reconciler_from_isvc_test
    modelmeshserving:
      managementState: Managed
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: https://github.com/opendatahub-io/modelmesh-serving/tarball/main
          - contextDir: config
            uri: https://github.com/lampajr/odh-model-controller/tarball/lampajr20231219_gh249_reconciler_from_isvc_test
    ray:
      managementState: Removed
    workbenches:
      managementState: Managed
    trustyai:
      managementState: Removed

NOTE: lampajr20231219_gh249_reconciler_from_isvc_test contains manifest changes to make use of the correct odh-model-controller image: alampare/odh-model-controller

By default this new model-registry for model-serving reconciler/controller is disabled in odh-model-controller, in order to properly enable it you should just start the controller with --model-registry-enabled flag.
https://github.com/lampajr/odh-model-controller/tarball/lampajr20231219_gh249_reconciler_from_isvc_test already contains this change in the configuration.

  1. Install model registry operator
  2. Clone model-registry-operator repository: git clone https://github.com/opendatahub-io/model-registry-operator.git
  3. Install CRDs: make KUBECTL=oc install
  4. Deploy the operator: make KUBECTL=oc deploy

Setup DS project

  1. Create a Data Science project using the ODH dashboard, e.g., demo-model-registry-20240104
  2. Setup a data connection (you could install a local minio instance in the newly create DS project)
  3. Create a model server using the ODH dashboard
  4. Apply the sample model registry CR from the model-registry-operator folder: kustomize build config/samples | oc apply -f - (make sure oc is pointing to the data science project create in step 1.)
  5. Create a new model server using the ODH dashboard, in my case named model-server

Upload models

For the sake of simplicity let's upload some onnx files in the local minio instance (bucket name is models):

  • models/mnist/v12/mnist-12.onnx
  • models/mnist/v8/mnist-8.onnx

Fill up the model registry

NOTE: IDs could change depending on the order of operations and on the number of models you are going to register.

  1. Set the model registry proxy url, you might have to setup a custom manual route from port :8080 of the model registry proxy
MR_HOSTNAME="<endpoint>"
  1. Register an empty model, e.g, MNIST
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/registered_models" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "MNIST model for recognizing handwritten digits",
  "name": "MNIST",
  "state": "LIVE"
}'
  1. Add a new model version
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "MNIST model version n. 8",
  "name": "v8",
  "registeredModelID": "1",
  "state": "LIVE"
}'
  1. Add the model artifact to the created model version
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions/2/artifacts" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "Model artifact for MNIST v8",
  "uri": "s3://models/mnist/v8/mnist-8.onnx",
  "state": "UNKNOWN",
  "name": "v8-model",
  "modelFormatName": "onnx",
  "modelFormatVersion": "1",
  "storageKey": "aws-connection-models",
  "storagePath": "mnist/v8",
  "artifactType": "model-artifact"
}'
  1. [Optional] If for some reason you need to update the artifact you can run:
curl --silent -X 'PATCH' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_artifacts/1" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '  {
    "description": "Model artifact for MNIST v8",
    "modelFormatName": "onnx",
    "modelFormatVersion": "1",
    "state": "UNKNOWN",
    "storageKey": "aws-connection-models",
    "storagePath": "mnist/v8",
    "uri": "s3://models/mnist/v8/mnist-8.onnx"
  }
'

NOTE: modelFormatName, modelFormatVersion, storageKey and storagePath are all those information needed to create the InferenceService so they should be valid values. E.g., the storageKey must match the existing data connection name (which is aws-connection-{{connection name}}).

You can use the following curls to inspect model registry content:

# Get registered models
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/registered_models?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get model versions for a specific registered model
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get model artifacts for a specific model version
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions/3/artifacts?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get all inference services stored in model registry
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

Test model controller workflow

As soon as you create an InferenceService CR in the project, the odh-model-controller will create the ServingEnvironment in the model registry having the name == namespace (i.e., DS project), you can inspect its ID:

curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/serving_environments?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'
  1. Apply the InferenceService with proper links to the registered model (IDs might be different):
oc apply -n demo-model-registry-20240104 -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "mnist-v8-model"
  annotations:
    "openshift.io/display-name": "mnist-v8-model"
    "serving.kserve.io/deploymentMode": "ModelMesh"
  labels:
    "modelregistry.opendatahub.io/registered-model-id": "1"
    "modelregistry.opendatahub.io/model-version-id": "2"
    "opendatahub.io/dashboard": "true"
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: "1"
      runtime: model-server
      storage:
        key: aws-connection-models
        path: mnist/v8/mnist-8.onnx
EOF
  • modelregistry.opendatahub.io/registered-model-id identifies the registered model we would like to deploy
  • modelregistry.opendatahub.io/model-version-id identifies the specific version to deploy, if omitted (as in this case) let's deploy the latest version for that registered model

NOTE: the IDs must match, therefore ensure you are providing the corrects IDs.

At this point, the odh-model-controller will monitor InferenceService CRs and based on their labels will create the corresponding records in model registry

To check the model registry you can run:

curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

Expected output:

[
  {
    "createTimeSinceEpoch": "1704383784228",
    "customProperties": {},
    "id": "4",
    "lastUpdateTimeSinceEpoch": "1704383784228",
    "modelVersionId": "3",
    "name": "mnist-v8-model/4a2b96b4-af76-4270-bd2d-1a5aed2cd6ad",
    "registeredModelId": "2",
    "runtime": "model-server",
    "servingEnvironmentId": "1",
    "desiredState": "DEPLOYED"
  }
]

Check the ISVC has been correctly linked to the newly created record in model registry:

oc get inferenceservices.serving.kserve.io/mnist-v8-model -o yaml

Expected output (note modelregistry.opendatahub.io/inference-service-id label):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  finalizers:
  - modelregistry.opendatahub.io/finalizer
  generation: 1
  labels:
    modelregistry.opendatahub.io/inference-service-id: 4
    modelregistry.opendatahub.io/registered-model-id": 1
    modelregistry.opendatahub.io/model-version-id": 2
    opendatahub.io/dashboard: true
  name: mnist-v8-model
  namespace: demo-model-registry-20240104
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: "1"
      name: ""
      resources: {}
      runtime: model-server
      storage:
        key: aws-connection-models
        path: mnist/v8
  1. Delete the InferenceService (you can do this using the ODH dashboard as well)
oc delete -n demo-model-registry-20240104 -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "mnist-v8-model"
  annotations:
    "openshift.io/display-name": "mnist-v8-model"
    "serving.kserve.io/deploymentMode": "ModelMesh"
  labels:
    "modelregistry.opendatahub.io/registered-model-id": "1"
    "modelregistry.opendatahub.io/model-version-id": "2"
    "opendatahub.io/dashboard": "true"
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: 1
      runtime: model-server
      storage:
        key: aws-connection-models
        path: mnist/v8
EOF

Check record in model registry is correctly update, note the UNDEPLOYED state:

curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

Expected output:

[
  {
    "createTimeSinceEpoch": "1704383784228",
    "customProperties": {},
    "desiredState": "UNDEPLOYED",
    "id": "4",
    "lastUpdateTimeSinceEpoch": "1704384020813",
    "modelVersionId": "3",
    "name": "mnist-v8-model/4a2b96b4-af76-4270-bd2d-1a5aed2cd6ad",
    "registeredModelId": "2",
    "runtime": "model-server",
    "servingEnvironmentId": "1"
  }
]

NOTE step Upload models and Fill up the model registry could be automated using either notebooks or pipelines, please follow a more complete e2e demo at https://github.com/tarilabs/demo20231121#demo

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Jan 4, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@lampajr lampajr force-pushed the lampajr20231219_gh249_reconciler_from_isvc branch 6 times, most recently from 837a40e to 15935b0 Compare January 11, 2024 16:50
@lampajr lampajr force-pushed the lampajr20231219_gh249_reconciler_from_isvc branch from 15935b0 to 5d1c108 Compare January 12, 2024 12:36
@lampajr
Copy link
Contributor Author

lampajr commented Jan 12, 2024

As agreed we decided to implement this workflow where the model-controller has a more "passive" behavior with respect to the model registry reconcilication.

As described in the PR description, the new reconciler will monitor the InferenceService CRs that have proper labels defined and based on those values it will sync the deployments to model registry in the form of model registry InferenceService objects.

NOTE: under test/ folder I setup some e2e tests that I was used to run locally (added step-by-step instructions in the description) - if you think it could make sense I can add a github workflow to automatically run them using Kind

PS: all commits must be squashed as soon as the PR is approved or directly during the merge

@lampajr lampajr force-pushed the lampajr20231219_gh249_reconciler_from_isvc branch from 5d1c108 to 6c60f51 Compare January 12, 2024 14:54
@lampajr lampajr marked this pull request as ready for review January 12, 2024 14:54
@openshift-ci openshift-ci bot requested review from israel-hdez and spolti January 12, 2024 14:54
@lampajr
Copy link
Contributor Author

lampajr commented Jan 12, 2024

fyi @tarilabs @dhirajsb @rareddy

namespace: default
labels:
"mr-inference-service-id": "4"
finalizers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This finalizer will be added to the isvc created by mr controller? The finalizer of the isvc will not be removed by mr controller, right?

Copy link
Contributor Author

@lampajr lampajr Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm not exactly, the finalizer will be added my the odh-model-controller itself (see https://github.com/opendatahub-io/odh-model-controller/pull/135/files#diff-e628c4483237f3dba685d802623c535845221a01b2bd2e3782a82ab9df48320aR120-R129) and then it will be removed by the model controller as well during the deletion.

The ISVC are going to be created by the dashboard as it is right now

@lampajr lampajr force-pushed the lampajr20231219_gh249_reconciler_from_isvc branch from 29f6c8c to e368740 Compare February 8, 2024 13:37
tarilabs added a commit to tarilabs/model-registry-bf4-kf that referenced this pull request Feb 8, 2024
In the scope of testing of Model Registry in openshift-ci:
- make a Shell script which invokes some REST calls to MR,
- so to make sure the REST endpoint is responsive,
- then create a K8s ISVC on the cluster,
- and display the MR InferenceService entities.

Later, in a subsequent issue/PR, once:
opendatahub-io/odh-model-controller#135
is merged, the last bulletpoint can be automated and placed under test in the final part of this script so to make sure the K8s ISVC on the cluster reflected as a precise MR InferenceService entity.
Co-authored-by: Edgar Hernández <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
@lampajr lampajr force-pushed the lampajr20231219_gh249_reconciler_from_isvc branch from e368740 to c9cbe07 Compare February 8, 2024 14:07
lampajr and others added 3 commits February 8, 2024 15:28
Co-authored-by: Edgar Hernández <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
@israel-hdez
Copy link
Contributor

/retest

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lampajr I'm good with the changes.

I haven't tried this live. I'll try it on my Friday.

I think the only pending comment is the one from Vedant about the manifests. Let me know when you are ready. If I don't see any issue in my live testing, I'll merge once you confirm this is ready to go.

tarilabs added a commit to tarilabs/model-registry-bf4-kf that referenced this pull request Feb 9, 2024
In the scope of testing of Model Registry in openshift-ci:
- make a Shell script which invokes some REST calls to MR,
- so to make sure the REST endpoint is responsive,
- then create a K8s ISVC on the cluster,
- and display the MR InferenceService entities.

Later, in a subsequent issue/PR, once:
opendatahub-io/odh-model-controller#135
is merged, the last bulletpoint can be automated and placed under test in the final part of this script so to make sure the K8s ISVC on the cluster reflected as a precise MR InferenceService entity.
Signed-off-by: Andrea Lamparelli <[email protected]>
Copy link
Contributor Author

@lampajr lampajr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried this live. I'll try it on my Friday.

I think the only pending comment is the one from Vedant about the manifests. Let me know when you are ready. If I don't see any issue in my live testing, I'll merge once you confirm this is ready to go.

Thsnk @israel-hdez I hope I added all details with regard to "how to test" this feature in a real cluster.

@VedantMahabaleshwarkar I agree with all your comments and I think I was able to simplify the dev manifests with my latest commit but unfortunately I wasn't able to fix the crd/external issue - I mean I couldn't find a way to include all those CRDs just in the overlays/dev because of

Error: accumulating resources: 2 errors occurred:
	* accumulateFile error: "accumulating resources from '../../crd/external/serving.kserve.io_inferenceservices.yaml': security; file '/home/alampare/repos/odh-model-controller/config/crd/external/serving.kserve.io_inferenceservices.yaml' is not in or below '/home/alampare/repos/odh-model-controller/config/overlays/dev'"
	* loader.New error: "error loading ../../crd/external/serving.kserve.io_inferenceservices.yaml with git: url lacks host: ../../crd/external/serving.kserve.io_inferenceservices.yaml, dir: got file 'serving.kserve.io_inferenceservices.yaml', but '/home/alampare/repos/odh-model-controller/config/crd/external/serving.kserve.io_inferenceservices.yaml' must be a directory to be a root, get: invalid source string: ../../crd/external/serving.kserve.io_inferenceservices.yaml"

on the other hand I do not see where those crd/external/kustomization is used, hence I don't see why keeping all those CRDs enable would be an issue. Is it referenced from other repos?

- monitoring.coreos.com_servicemonitors.yaml
- maistra.io_servicemeshmemberrolls.yaml
- maistra.io_servicemeshmembers.yaml
- telemetry.istio.io_telemetries.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with you @VedantMahabaleshwarkar but I don't know how to reference to crd/external from the overlays/dev without directly changing crd/external/kustomization.yaml - I tried to directly add all CRDs into the overlays/dev/kustomization but I get:

Error: accumulating resources: 2 errors occurred:
	* accumulateFile error: "accumulating resources from '../../crd/external/serving.kserve.io_inferenceservices.yaml': security; file '/home/alampare/repos/odh-model-controller/config/crd/external/serving.kserve.io_inferenceservices.yaml' is not in or below '/home/alampare/repos/odh-model-controller/config/overlays/dev'"
	* loader.New error: "error loading ../../crd/external/serving.kserve.io_inferenceservices.yaml with git: url lacks host: ../../crd/external/serving.kserve.io_inferenceservices.yaml, dir: got file 'serving.kserve.io_inferenceservices.yaml', but '/home/alampare/repos/odh-model-controller/config/crd/external/serving.kserve.io_inferenceservices.yaml' must be a directory to be a root, get: invalid source string: ../../crd/external/serving.kserve.io_inferenceservices.yaml"

any idea?

tarilabs added a commit to opendatahub-io/model-registry-bf4-kf that referenced this pull request Feb 12, 2024
* add Shell script openshift-ci make some REST call to MR

In the scope of testing of Model Registry in openshift-ci:
- make a Shell script which invokes some REST calls to MR,
- so to make sure the REST endpoint is responsive,
- then create a K8s ISVC on the cluster,
- and display the MR InferenceService entities.

Later, in a subsequent issue/PR, once:
opendatahub-io/odh-model-controller#135
is merged, the last bulletpoint can be automated and placed under test in the final part of this script so to make sure the K8s ISVC on the cluster reflected as a precise MR InferenceService entity.

* omit OCP_CLUSTER_NAME to be valorized once on openshift-ci
@lampajr
Copy link
Contributor Author

lampajr commented Feb 14, 2024

Hi @israel-hdez @VedantMahabaleshwarkar , thanks a lot for all your suggestions!!

I think I was able to address all of them, I also tested again in a real cluster and it worked for me (following the instructions I provided in the PR description).

@israel-hdez
Copy link
Contributor

/hold

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to try this and runs fine.
This is low risk because it is turned off by default. So, I'll merge.

I left one comment for your consideration for later improvement.

}
Expect(grpcPort).ToNot(BeNil())

mlmdAddr = fmt.Sprintf("localhost:%d", *grpcPort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change the code because of this connection to localhost.
This is somewhat specific to Kind. I was able to workaround by doing a port-forward, but given the tests re-deploy the registry, the port-forward needs to be re-opened before each test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for pointing this out! I will keep track of this in a separate issue 💪

Copy link
Contributor

openshift-ci bot commented Feb 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, lampajr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@israel-hdez israel-hdez merged commit 518cfa2 into opendatahub-io:main Feb 15, 2024
4 of 5 checks passed
@lampajr lampajr deleted the lampajr20231219_gh249_reconciler_from_isvc branch February 15, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[model-controller] Investigate serving integration from cluster data
7 participants