-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup model registry serving reconciliation #124
Setup model registry serving reconciliation #124
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lampajr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8e5397e
to
f0e4560
Compare
aaa6e2d
to
d99d986
Compare
I am struggling finding a way to test this new reconciliation as part of the unit testing already setup, i.e., using I setup the new reconciler in the existing tests (this should guarantee all other reconcilers are not affected by this new one) but to correctly test the behavior I should be able to connect to an existing Model Registry, this is a sort of preconditions otherwise the reconcile loop won't execute anything at all. Therefore it would be more like an e2e test rather than a unit test. I was thinking to make use of a TestContainer to setup the Model Registry and connect to that, not sure if that is really feasible and if that is the correct approach here. Any suggestion or advice is really appreciated here!! |
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
Signed-off-by: Andrea Lamparelli <[email protected]>
d99d986
to
ca2d6a3
Compare
For the records I was able to properly setup test for the new controller by making use of TestContainer to simulate the model-registry (actually the ml-metadata) service. |
/test all |
Ok it looks like this is not working on openshift-ci as it seems there is no docker daemon running there:
Is that not running for a reason? does it could be enabled somehow? |
One thing that came to my mind was to mock the model registry library by providing static content, this way I can check reconciler behavior based on that content. This might solve the unit test part, wdyt? On the other hand setting up some ITs where there is a real connection to a MLMD service could be very helpful as in this case the real model registry library would be used. |
1671f36
to
1064261
Compare
With commit 1064261 I updated the unit testing for this new reconciliation loop, in particular I removed the TestContainer approach and I mocked the model registry service. Will evaluate/setup e2e test in a separate PR to avoid blocking this for too long. |
Signed-off-by: Andrea Lamparelli <[email protected]>
1064261
to
4e17ea0
Compare
Signed-off-by: Andrea Lamparelli <[email protected]>
With commit 177f9af I just updated the model registry service to the latest one which included some fixes. @spolti @Jooho the PR is ready for a review so whenever you guys have some time, do you mind to check it? Thanks a lot! As I also mentioned in the description this PR is actually the implementation of the workflow diagram described and agreed here opendatahub-io/model-registry#104 |
@@ -167,8 +167,11 @@ rules: | |||
resources: | |||
- inferenceservices | |||
verbs: | |||
- create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this 2 additional verbs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, that is because the new reconciliation will create/delete/update inferenceservices
objects based on the Model Registry content which is actually the main purpose of this new reconciler
Pull request drafted as we are still considering which direction we would like to implement first, this one or #135 - since these two PRs are conflicting each other, once we decide which one will go first I will adapt the other one as well. |
Implement the model registry InferenceService reconciliation, the implemented workflow is the opposite of the one proposed in #124. Here the direction goes from Cluster to ModelRegistry. The new reconciler will monitor InferenceService CRs having pre-defined labels, based on those labels will sync the model registry by keeping track of every deployment that occurred in the cluster. Then will update the InferenceService CR by linking it to the model registry record using a specific label. * Fix overlays/dev deployment * Setup InferenceService reconciliation to model registry * Setup e2e test for new reconciler * Enable new reconciler * Create serving env when not found * Retrieve MR namespace from ISVC label * Added Kind cluster configuration for local e2e tests * Check if IS already exists in MR * Improve e2e testing * Refactor dev manifests Signed-off-by: Andrea Lamparelli <[email protected]> Co-authored-by: Edgar Hernández <[email protected]>
Since #135 was merged, I would close this one and if we will need to implement again this workflow I'd suggest to start fresh new PR. |
Fixes https://github.com/opendatahub-io/model-registry/issues/174
Fixes https://github.com/opendatahub-io/model-registry/issues/175
Description
Implement the Model Registry <--> Model Serving interaction described in issue opendatahub-io/model-registry#104.
Setup a new Reconciler that is triggered whenever a
ServingRuntime
is create/updated/deleted and in a periodic fashion (i.e., every N seconds where N could be specified as startup flag, default60 seconds
).The summarized reconciliation workflow is the following:
<Namespace>/<ServingRuntime>
do:Runtime
the currentServingRuntime
name.<InferenceService>
do:ODH model controller keep track of those operation in the model registry itself, specifically as
ServeModel
recods (one per deployment).These records can have different states, for now I used the following:
How Has This Been Tested?
Setup cluster
By default this new model-registry for model-serving reconciler/controller is disabled in
odh-model-controller
, in order to properly enable it you should just start the controller with--model-registry-enabled
flag.https://github.com/lampajr/odh-model-controller/tarball/mr_serving_reconciler_test
already contains this change in the configuration.git clone https://github.com/opendatahub-io/model-registry-operator.git
make KUBECTL=oc install
make KUBECTL=oc deploy
Setup DS project
demo-model-registry-20231211
kustomize build config/samples | oc apply -f -
(make sureoc
is pointing to the data science project create in step 1.)model-server
Upload models
For the sake of simplicity let's upload some
onnx
files in the local minio instance (bucket name ismodels
):models/mnist/v12/mnist-12.onnx
models/mnist/v8/mnist-8.onnx
Fill up the model registry
:8080
of the model registry proxyMR_HOSTNAME="<endpoint>"
You can use the following curls to inspect model registry content:
Test model controller workflow
As soon as you create a model server in the DS project, the odh-model-controller will create the
ServingEnvironment
in the model registry having thename == namespace
(i.e., DS project), you can inspect its ID:The
runtime
must match the name of the model server create in the ODH platform for the current DS project.In this specific example we are going to store in the model registry the intention to deploy a model into
model-server
Openvino server:servingEnvironmentId
identifies the namespace where the deployment should occur, in this case id1
which identifies DS projectdemo-model-registry-20231211
registeredModelId
identifies the registered model we would like to deploymodelVersionId
identifies the specific version to deploy, if omitted (as in this case) let's deploy the latest version for that registered modelAt this point, the odh-model-controller will periodically inspect all existing
ServingRuntime
CRs (identifying configured model servers) and for each of them it will retrieve all Inference Services stored in model registry (using ServingEnv and Runtime as search criteria), then for each IS in model registry will do:Let's try to tag a new version and see if the reconciliation will automatically update the CR, triggering the atuomated deployment of the latest version (the new one)
The next reconciliation loop should find that there is a Delta since the latest version is different, therefore the desired CR does not match the existing one. Once th eCR is updated, the model server should automatically react to deploy the new model:
Form the odh-model-controller logs:
As a result the model deployment should disappear from the ODH dashboard too.
Here some logs from
ovms-adapter
:ServeModel
records (associated to a specific model registry inference service) that you can inspect as follows:Merge criteria: