Clean-up ctrl result and error consideration for GlanceAPI #684

abays · 2025-01-21T12:31:43Z

I was looking at the GlanceAPI controller today to compare it with some new code being added in the nascent Watcher Operator. While doing so, I noticed what seemed to be inconsistencies in how we handle ctrl.Result and error considerations in some places. I don't think any of this is necessarily crucial.

openshift-ci · 2025-01-21T12:31:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abays]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abays · 2025-01-21T12:45:44Z

{  error occurred handling build src-amd64: could not get build src-amd64: builds.build.openshift.io "src-amd64" not found}

😆

/test glance-operator-build-deploy-kuttl

abays · 2025-01-21T14:17:24Z

{  error occurred handling build src-amd64: could not get build src-amd64: builds.build.openshift.io "src-amd64" not found}

😡

/retest

abays · 2025-01-21T15:18:40Z

error: timed out waiting for the condition on openstackcontrolplanes/openstack-galera

/test glance-operator-build-deploy-kuttl

abays · 2025-01-21T18:12:37Z

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

abays · 2025-01-21T19:19:16Z

Ah, I think I might understand what is happening. First, the glance_kuttl testing is installing the Swift operator independently, and its webhooks get installed alongside it via OLM. Then that testing seems to leave the Swift operator at least partially installed, such that the webhooks linger. After that task, OpenStack operator is installed and we deploy OpenStackControlPlane. But the new RHOSO install paradigm removes the webhook enablement from the individual service operators, so the lingering MutatingWebhookConfiguration for the previous atomic Swift operator installation interferes and tries to direct the creation of the OpenStackControlPlane's derivative Swift resource to the non-existent Swift operator webhook. And then you hit the error.

fmount · 2025-01-22T08:38:46Z

Ok thank you for all the details, I have to review the logs, but it should be easier given you already did that. I'm wondering why OpenstackControlPlane is installed in this context: kuttl should rely (at least for Glance) in a testing mechanism that is based on OLM for independent operators, w/ a ctlplane out of the picture.

abays · 2025-01-22T09:21:31Z

It looks like Prow does both an OpenStack operator install and accompanying OpenStackControlPlane deploy [1], and also then runs the KUTTL tests [2].

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L66-L78
[2] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L106-L117

...but it seems like the KUTTL [2] runs after the basic OpenStack operator deployment [1]. So I'm confused as to how [1] is running into the Swift webhooks from [2].

abays · 2025-01-22T09:26:42Z

Maybe it's actually the openstack-k8s-operators-test-build-deploy-kuttl workflow itself that does both KUTTL and then a standard OpenStack operator deploy [1]. In that case, KUTTL goes first and then is followed by the generic OpenStack operator and OpenStackControlPlane deploy, which fits my hypothesis above.

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/step-registry/openstack-k8s-operators/test/build-deploy-kuttl/openstack-k8s-operators-test-build-deploy-kuttl-workflow.yaml#L10-L11

fmount · 2025-01-22T09:34:38Z

Ok, sorry. I'll review them carefully. My understanding of [1] was that build-deploy and kuttl were attached to 2 diff workflows, but you're right, as we have the first deployment here [2].
However the Prow execution flow looks weird, as [3] reports working kuttl tests (grep for PASS: in [3]).
I'm wondering if those are flaky failures or we have a bug in resource cleanup process between the workflows defined in [1].

fmount · 2025-01-22T09:34:52Z

/test glance-operator-build-deploy-kuttl

fmount · 2025-01-22T09:36:15Z

I'm wondering if we really need it as we have a tempest job that covers that part.

fmount · 2025-01-22T10:03:12Z

@abays as per your theory openstack-k8s-operators/install_yamls#999 should solve the webhook resource cleanup. We might want to land that patch first and re-kick this job.

abays · 2025-01-22T10:08:29Z

I think that will clean up the webhooks installed by OpenStack operator, but it wouldn't help with the lingering Swift webhooks created by make glance_kuttl because they lack the requisite labels [1]. And openstack_cleanup wouldn't be called until the very end of the OpenStack operator deploy, unless if somehow glance_kuttl calls it when it finishes.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/webhooks/mutating/mswift.kb.io-gh2nm.yaml

fmount · 2025-01-22T10:16:14Z

Ok I see. I think we need to basically cleanup resources as part of glance_kuttl target, where I see we only cleanup glance and then we run make kuttl_common_cleanup.

abays · 2025-01-22T11:50:16Z

It passed this time because the Swift operator somehow persisted beyond the make glance_kuttl execution, such that it was available during the subsequent OpenStack operator deploy workflow. Somehow the Swift Deployment lingered and was not replaced by the OpenStack operator's version:

  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: swift-operator.v0.0.1
    uid: 802ed280-5757-4a00-9e7b-7b188cb50330

fmount · 2025-01-22T12:08:47Z

/lgtm

fmount · 2025-01-22T12:11:14Z

Thank you @abays. I'm merging this one as it improves the error handling. Hopefully, with openstack-k8s-operators/install_yamls#1000, we should be able to see the resource cleanup from a swift perspective, and we can check it from other (still pending) PRs.

Clean-up ctrl result and error consideration for GlanceAPI

af494e5

openshift-ci bot requested review from fultonj and konan-abhi January 21, 2025 12:31

openshift-ci bot added the approved label Jan 21, 2025

abays requested a review from fmount January 21, 2025 12:40

openshift-ci bot assigned fmount Jan 22, 2025

openshift-ci bot added the lgtm label Jan 22, 2025

openshift-merge-bot bot merged commit 72c189b into openstack-k8s-operators:main Jan 22, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean-up ctrl result and error consideration for GlanceAPI #684

Clean-up ctrl result and error consideration for GlanceAPI #684

abays commented Jan 21, 2025

openshift-ci bot commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

abays commented Jan 22, 2025 •

edited

Loading

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

Clean-up ctrl result and error consideration for GlanceAPI #684

Clean-up ctrl result and error consideration for GlanceAPI #684

Conversation

abays commented Jan 21, 2025

openshift-ci bot commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

abays commented Jan 21, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

abays commented Jan 22, 2025 • edited Loading

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025

fmount commented Jan 22, 2025

fmount commented Jan 22, 2025

abays commented Jan 22, 2025 •

edited

Loading