Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Ensure odh-model-controller Deployment Waits for ConfigMap #361

Conversation

trujillm
Copy link
Contributor

@trujillm trujillm commented Feb 3, 2025

Fix: Ensure odh-model-controller Deployment Waits for ConfigMap & Adjust NIM_STATE Handling

Issue

  • The Kustomization for odh-model-controller creates both the Deployment and the ConfigMap.
  • However, the Deployment is set with optional: true for the ConfigMap, meaning it can start before the ConfigMap is available.
  • If the Deployment starts first (as seen in QE), it runs without the NIM_STATE environment variable, causing the backend to ignore Accounts.

Fix 1: Ensure ConfigMap Availability

  • Set optional: false for the ConfigMap in the Deployment.
  • This ensures the Deployment waits for the ConfigMap before starting.
  • Since both resources are managed by the same Kustomization and the ConfigMap has a default value, this change is safe.

Fix 2: Adjust NIM_STATE Handling

  • Stop checking for an empty string when validating NIM_STATE.
  • Instead of:
    if !slices.Contains([]string{"removed", ""}, nimState) {
    Change to:
    if nimState != "removed" {
  • This ensures that if NIM_STATE is not specified, it defaults to enabled instead of off.

Impact

  • Ensures consistent deployment behavior.

  • Prevents the backend from ignoring Accounts due to missing NIM_STATE.

  • Enables NIM by default when not explicitly set.

  • The commits are squashed in a cohesive manner and have meaningful messages.

  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).

  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Feb 3, 2025

Hi @trujillm. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@trujillm trujillm marked this pull request as draft February 3, 2025 16:08
@trujillm trujillm changed the title initial commit to fix NIM and deployment issues Fix: Ensure odh-model-controller Deployment Waits for ConfigMap Feb 3, 2025
@Jooho
Copy link
Contributor

Jooho commented Feb 3, 2025

@trujillm I know this is draft but could you please add unit ?

@trujillm
Copy link
Contributor Author

trujillm commented Feb 3, 2025

@trujillm I know this is draft but could you please add unit ?

Can you clarify what you mean @Jooho ?

@trujillm trujillm marked this pull request as ready for review February 4, 2025 14:30
@@ -105,7 +105,6 @@ spec:
configMapKeyRef:
name: odh-model-controller-parameters
key: nim-state
optional: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we keep it as optional?
I mean, the os.GetEnv is ok if the env is not set( with the suggested change), however, if for some reason it is not set in during the startup, it would prevent odh to start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spolti I believe the intent is to block the deployment until the configmap is ready which I believe is the expected behavior for having the optional value defaulted to false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spolti

shouldn't we keep it as optional?
I mean, the os.GetEnv is ok if the env is not set( with the suggested change)

We have two different default behaviours, for IBM the default is "removed", and for everyone else, it's "managed". So we need the ConfigMap to be configured properly by the operator-controller and we cannot decide a default value without the ConfigMap.

however, if for some reason it is not set in during the startup, it would prevent odh to start.

Is this feasible? Both the Deployment and the ConfigMap are created by the same kustomization file. The ConfigMap is used and probably required by other parts too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for IBM the default is "removed",

How is this achieved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerFi I understand why you choose removed for the condition. I am ok with it.
However, I have a question about the IBM case. How can you set the default value removed?
The configmap manifests will be reside in RHOAI operator so it can not be editable after release.

Copy link
Contributor

@TomerFi TomerFi Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagin the pipelins running for IBM set the DataScienceCluster.spec.kserve.nim.managedState to "removed", this gets translated into the params.env before executing the kustomization file that creates both the ConfigMap and the Deployment.

Copy link
Contributor

@Jooho Jooho Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Opendatahub operator have the logic, it makes sense then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading back my comment, I think I was unclear. When I wrote: We have two different default behaviours, of course, we don't actually have two defaults, sorry about that. My point was, we can't decide what default value to use, hence we must rely on the ConfigMap that always has the correct value for us.

cmd/main.go Show resolved Hide resolved
@trujillm
Copy link
Contributor Author

trujillm commented Feb 4, 2025

@Jooho @spolti Updates have been made please review at your earliest convenience.

Thanks

@trujillm
Copy link
Contributor Author

trujillm commented Feb 4, 2025

/retest

Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

@trujillm: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@Jooho Jooho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR but I have a question about IBM case

@@ -105,7 +105,6 @@ spec:
configMapKeyRef:
name: odh-model-controller-parameters
key: nim-state
optional: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerFi I understand why you choose removed for the condition. I am ok with it.
However, I have a question about the IBM case. How can you set the default value removed?
The configmap manifests will be reside in RHOAI operator so it can not be editable after release.

@Jooho
Copy link
Contributor

Jooho commented Feb 4, 2025

/rerun-all

@israel-hdez
Copy link
Contributor

/ok-to-test

@israel-hdez
Copy link
Contributor

/retest

@Jooho
Copy link
Contributor

Jooho commented Feb 4, 2025

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 4, 2025
Copy link
Contributor

openshift-ci bot commented Feb 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jooho, trujillm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 4, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 7806732 into opendatahub-io:incubating Feb 4, 2025
7 checks passed
TomerFi pushed a commit to TomerFi/odh-model-controller that referenced this pull request Feb 6, 2025
…datahub-io#361)

* initial commit to fix NIM and deployment issues

Signed-off-by: mtrujillo <[email protected]>

* update to set default on nimState

Signed-off-by: mtrujillo <[email protected]>

* updated optinal to false

Signed-off-by: mtrujillo <[email protected]>

---------

Signed-off-by: mtrujillo <[email protected]>
(cherry picked from commit 7806732)
TomerFi pushed a commit to TomerFi/odh-model-controller that referenced this pull request Feb 6, 2025
…datahub-io#361)

* initial commit to fix NIM and deployment issues

Signed-off-by: mtrujillo <[email protected]>

* update to set default on nimState

Signed-off-by: mtrujillo <[email protected]>

* updated optinal to false

Signed-off-by: mtrujillo <[email protected]>

---------

Signed-off-by: mtrujillo <[email protected]>
TomerFi pushed a commit to TomerFi/odh-model-controller that referenced this pull request Feb 6, 2025
TomerFi pushed a commit to TomerFi/odh-model-controller that referenced this pull request Feb 6, 2025
…datahub-io#361)

* initial commit to fix NIM and deployment issues

Signed-off-by: Marcus Trujillo <[email protected]>

* update to set default on nimState

Signed-off-by: Marcus Trujillo <[email protected]>

* updated optinal to false

Signed-off-by: Marcus Trujillo <[email protected]>

---------

Signed-off-by: Marcus Trujillo <[email protected]>
(cherry picked from commit 7806732)
openshift-merge-bot bot pushed a commit that referenced this pull request Feb 7, 2025
* fix: added accept header to nim manifest pull requests accepting oci (#362)

* fix: added accept header to nim manifest pull requests accepting oci

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: removed left-over debugging message

Signed-off-by: Tomer Figenblat <[email protected]>

---------

Signed-off-by: Tomer Figenblat <[email protected]>
(cherry picked from commit f7fdf0a)

* Fix: Ensure odh-model-controller Deployment Waits for ConfigMap (#361)

* initial commit to fix NIM and deployment issues

Signed-off-by: Marcus Trujillo <[email protected]>

* update to set default on nimState

Signed-off-by: Marcus Trujillo <[email protected]>

* updated optinal to false

Signed-off-by: Marcus Trujillo <[email protected]>

---------

Signed-off-by: Marcus Trujillo <[email protected]>
(cherry picked from commit 7806732)

* fix: nim containers terminated prematurely (#363)

Signed-off-by: Tomer Figenblat <[email protected]>
Co-authored-by: Mikhail Mikhailitchenko <[email protected]>
(cherry picked from commit c92b19f)

* chore: added gocyclo no linting comment on the main function

Signed-off-by: Tomer Figenblat <[email protected]>

---------

Signed-off-by: Tomer Figenblat <[email protected]>
Co-authored-by: Marcus Trujillo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants