Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean-up ctrl result and error consideration for GlanceAPI #684

Merged

Conversation

abays
Copy link
Contributor

@abays abays commented Jan 21, 2025

I was looking at the GlanceAPI controller today to compare it with some new code being added in the nascent Watcher Operator. While doing so, I noticed what seemed to be inconsistencies in how we handle ctrl.Result and error considerations in some places. I don't think any of this is necessarily crucial.

@openshift-ci openshift-ci bot requested review from fultonj and konan-abhi January 21, 2025 12:31
Copy link
Contributor

openshift-ci bot commented Jan 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abays

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abays abays requested a review from fmount January 21, 2025 12:40
@abays
Copy link
Contributor Author

abays commented Jan 21, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

{  error occurred handling build src-amd64: could not get build src-amd64: builds.build.openshift.io "src-amd64" not found}

😆

/test glance-operator-build-deploy-kuttl

@abays
Copy link
Contributor Author

abays commented Jan 21, 2025

@abays: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl
ci/prow/functional af494e5 link true /test functional

Full PR test history. Your PR dashboard.

{  error occurred handling build src-amd64: could not get build src-amd64: builds.build.openshift.io "src-amd64" not found}

😡

/retest

@abays
Copy link
Contributor Author

abays commented Jan 21, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

error: timed out waiting for the condition on openstackcontrolplanes/openstack-galera

/test glance-operator-build-deploy-kuttl

@abays
Copy link
Contributor Author

abays commented Jan 21, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl

Full PR test history. Your PR dashboard.

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

@abays
Copy link
Contributor Author

abays commented Jan 21, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl
Full PR test history. Your PR dashboard.

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

Ah, I think I might understand what is happening. First, the glance_kuttl testing is installing the Swift operator independently, and its webhooks get installed alongside it via OLM. Then that testing seems to leave the Swift operator at least partially installed, such that the webhooks linger. After that task, OpenStack operator is installed and we deploy OpenStackControlPlane. But the new RHOSO install paradigm removes the webhook enablement from the individual service operators, so the lingering MutatingWebhookConfiguration for the previous atomic Swift operator installation interferes and tries to direct the creation of the OpenStackControlPlane's derivative Swift resource to the non-existent Swift operator webhook. And then you hit the error.

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl
Full PR test history. Your PR dashboard.

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

Ah, I think I might understand what is happening. First, the glance_kuttl testing is installing the Swift operator independently, and its webhooks get installed alongside it via OLM. Then that testing seems to leave the Swift operator at least partially installed, such that the webhooks linger. After that task, OpenStack operator is installed and we deploy OpenStackControlPlane. But the new RHOSO install paradigm removes the webhook enablement from the individual service operators, so the lingering MutatingWebhookConfiguration for the previous atomic Swift operator installation interferes and tries to direct the creation of the OpenStackControlPlane's derivative Swift resource to the non-existent Swift operator webhook. And then you hit the error.

Ok thank you for all the details, I have to review the logs, but it should be easier given you already did that. I'm wondering why OpenstackControlPlane is installed in this context: kuttl should rely (at least for Glance) in a testing mechanism that is based on OLM for independent operators, w/ a ctlplane out of the picture.

@abays
Copy link
Contributor Author

abays commented Jan 22, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl
Full PR test history. Your PR dashboard.

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

Ah, I think I might understand what is happening. First, the glance_kuttl testing is installing the Swift operator independently, and its webhooks get installed alongside it via OLM. Then that testing seems to leave the Swift operator at least partially installed, such that the webhooks linger. After that task, OpenStack operator is installed and we deploy OpenStackControlPlane. But the new RHOSO install paradigm removes the webhook enablement from the individual service operators, so the lingering MutatingWebhookConfiguration for the previous atomic Swift operator installation interferes and tries to direct the creation of the OpenStackControlPlane's derivative Swift resource to the non-existent Swift operator webhook. And then you hit the error.

Ok thank you for all the details, I have to review the logs, but it should be easier given you already did that. I'm wondering why OpenstackControlPlane is installed in this context: kuttl should rely (at least for Glance) in a testing mechanism that is based on OLM for independent operators, w/ a ctlplane out of the picture.

It looks like Prow does both an OpenStack operator install and accompanying OpenStackControlPlane deploy [1], and also then runs the KUTTL tests [2].

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L66-L78
[2] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L106-L117

...but it seems like the KUTTL [2] runs after the basic OpenStack operator deployment [1]. So I'm confused as to how [1] is running into the Swift webhooks from [2].

@abays
Copy link
Contributor Author

abays commented Jan 22, 2025

Maybe it's actually the openstack-k8s-operators-test-build-deploy-kuttl workflow itself that does both KUTTL and then a standard OpenStack operator deploy [1]. In that case, KUTTL goes first and then is followed by the generic OpenStack operator and OpenStackControlPlane deploy, which fits my hypothesis above.

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/step-registry/openstack-k8s-operators/test/build-deploy-kuttl/openstack-k8s-operators-test-build-deploy-kuttl-workflow.yaml#L10-L11

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

@abays: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/glance-operator-build-deploy-kuttl af494e5 link true /test glance-operator-build-deploy-kuttl
Full PR test history. Your PR dashboard.

Somehow Swift operator is getting multiple k8s services for webhook traffic [1]. As a result, at least some of them seem to be pointing to the wrong pod, and thus creation of Swift CRs fails and fails the job:

calling webhook "mswift.kb.io": failed to call webhook: Post "https://swift-operator-controller-manager-service.openstack-operators.svc:443/mutate-swift-openstack-org-v1beta1-swift?timeout=10s":
      dial tcp 10.130.0.197:9443: connect: connection refused'

I don't see other service operators having the same problem with k8s service pollution.
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/namespaces/openstack-operators/services/

Ah, I think I might understand what is happening. First, the glance_kuttl testing is installing the Swift operator independently, and its webhooks get installed alongside it via OLM. Then that testing seems to leave the Swift operator at least partially installed, such that the webhooks linger. After that task, OpenStack operator is installed and we deploy OpenStackControlPlane. But the new RHOSO install paradigm removes the webhook enablement from the individual service operators, so the lingering MutatingWebhookConfiguration for the previous atomic Swift operator installation interferes and tries to direct the creation of the OpenStackControlPlane's derivative Swift resource to the non-existent Swift operator webhook. And then you hit the error.

Ok thank you for all the details, I have to review the logs, but it should be easier given you already did that. I'm wondering why OpenstackControlPlane is installed in this context: kuttl should rely (at least for Glance) in a testing mechanism that is based on OLM for independent operators, w/ a ctlplane out of the picture.

It looks like Prow does both an OpenStack operator install and accompanying OpenStackControlPlane deploy [1], and also then runs the KUTTL tests [2].

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L66-L78 [2] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L106-L117

...but it seems like the KUTTL [2] runs after the basic OpenStack operator deployment [1]. So I'm confused as to how [1] is running into the Swift webhooks from [2].

Ok, sorry. I'll review them carefully. My understanding of [1] was that build-deploy and kuttl were attached to 2 diff workflows, but you're right, as we have the first deployment here [2].
However the Prow execution flow looks weird, as [3] reports working kuttl tests (grep for PASS: in [3]).
I'm wondering if those are flaky failures or we have a bug in resource cleanup process between the workflows defined in [1].

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/config/openstack-k8s-operators/glance-operator/openstack-k8s-operators-glance-operator-main.yaml#L66-L78
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-deploy/build-log.txt
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-kuttl/build-log.txt

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

/test glance-operator-build-deploy-kuttl

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

Maybe it's actually the openstack-k8s-operators-test-build-deploy-kuttl workflow itself that does both KUTTL and then a standard OpenStack operator deploy [1]. In that case, KUTTL goes first and then is followed by the generic OpenStack operator and OpenStackControlPlane deploy, which fits my hypothesis above.

[1] https://github.com/openshift/release/blob/0e26e4ceb3d8642eb19dd306f9c2a630d09deb41/ci-operator/step-registry/openstack-k8s-operators/test/build-deploy-kuttl/openstack-k8s-operators-test-build-deploy-kuttl-workflow.yaml#L10-L11

I'm wondering if we really need it as we have a tempest job that covers that part.

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

@abays as per your theory openstack-k8s-operators/install_yamls#999 should solve the webhook resource cleanup. We might want to land that patch first and re-kick this job.

@abays
Copy link
Contributor Author

abays commented Jan 22, 2025

@abays as per your theory openstack-k8s-operators/install_yamls#999 should solve the webhook resource cleanup. We might want to land that patch first and re-kick this job.

I think that will clean up the webhooks installed by OpenStack operator, but it wouldn't help with the lingering Swift webhooks created by make glance_kuttl because they lack the requisite labels [1]. And openstack_cleanup wouldn't be called until the very end of the OpenStack operator deploy, unless if somehow glance_kuttl calls it when it finishes.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/webhooks/mutating/mswift.kb.io-gh2nm.yaml

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

@abays as per your theory openstack-k8s-operators/install_yamls#999 should solve the webhook resource cleanup. We might want to land that patch first and re-kick this job.

I think that will clean up the webhooks installed by OpenStack operator, but it wouldn't help with the lingering Swift webhooks created by make glance_kuttl because they lack the requisite labels [1]. And openstack_cleanup wouldn't be called until the very end of the OpenStack operator deploy, unless if somehow glance_kuttl calls it when it finishes.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openstack-k8s-operators_glance-operator/684/pull-ci-openstack-k8s-operators-glance-operator-main-glance-operator-build-deploy-kuttl/1881723118666387456/artifacts/glance-operator-build-deploy-kuttl/openstack-k8s-operators-gather/artifacts/must-gather/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-e8859d2563069511071235c64bbd90ad638782ed6f3d8a6ca1750d7be5c967db/webhooks/mutating/mswift.kb.io-gh2nm.yaml

Ok I see. I think we need to basically cleanup resources as part of glance_kuttl target, where I see we only cleanup glance and then we run make kuttl_common_cleanup.

@abays
Copy link
Contributor Author

abays commented Jan 22, 2025

It passed this time because the Swift operator somehow persisted beyond the make glance_kuttl execution, such that it was available during the subsequent OpenStack operator deploy workflow. Somehow the Swift Deployment lingered and was not replaced by the OpenStack operator's version:

  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: ClusterServiceVersion
    name: swift-operator.v0.0.1
    uid: 802ed280-5757-4a00-9e7b-7b188cb50330

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

/lgtm

@fmount
Copy link
Contributor

fmount commented Jan 22, 2025

Thank you @abays. I'm merging this one as it improves the error handling. Hopefully, with openstack-k8s-operators/install_yamls#1000, we should be able to see the resource cleanup from a swift perspective, and we can check it from other (still pending) PRs.

@openshift-merge-bot openshift-merge-bot bot merged commit 72c189b into openstack-k8s-operators:main Jan 22, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants