Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix) Resolver: list CatSrc using client, instead of referring to registry-server cache #3349

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Jul 26, 2024

Using "available CatalogSources" information from the registry-client cache was causing cache inconsistency problems.

This has showed up multiple times in production environments over the years, manifesting itself in the form of the all subscriptions in a namespace being transitioned into an error state when a Catalogsource that the cache claims to exist, has actually been deleted from the cluster, but the cache was not updated.

The Subscriptions are transitioned to an error state because of the deleted catalogsource with the follwing error message:

"message": "failed to populate resolver cache from source : failed to list
bundles: rpc error: code = Unavailable desc = connection error: desc = "transport:
Error while dialing dial tcp: lookup ..svc on 172.....: no such host"",
"reason": "ErrorPreventedResolution",
"status": "True",
"type": "ResolutionFailed"

This PR switches the information lookup from the cache, to using a client to list the CatalogSources present in the cluster.

Description of the change:

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@anik120
Copy link
Contributor Author

anik120 commented Jul 26, 2024

Picking up #2779

@openshift-ci openshift-ci bot requested review from awgreene and dinhxuanvu July 26, 2024 20:32
@anik120
Copy link
Contributor Author

anik120 commented Jul 26, 2024

/hold

I want to test this a little more before I open it up for reviews.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024
@anik120 anik120 requested review from dtfranz, kevinrizza, ankitathomas and tmshort and removed request for awgreene and dinhxuanvu July 26, 2024 20:33
@anik120 anik120 self-assigned this Jul 26, 2024
@benluddy
Copy link
Contributor

This PR switches the information lookup from the cache, to using a client to list the CatalogSources present in the cluster.

Bolded to emphasize that the list of available catalogs is coming from a cache of registry-server clients, not a cache of CatalogSources. In other words, it wasn't reading stale cache entries, it was reading the wrong cache. Fixing this to actually be based on CatalogSources is a huge improvement.

@anik120 anik120 force-pushed the fix-stale-cache branch 2 times, most recently from f7fea70 to b8da86f Compare July 29, 2024 17:50
@anik120 anik120 force-pushed the fix-stale-cache branch 5 times, most recently from 0d51401 to 488fd2c Compare August 6, 2024 15:20
@anik120 anik120 changed the title (fix) List catalogsource using client, instead of referring to cache (fix) Resolver: list CatSrc using client, instead of referring to registry-server cache Aug 6, 2024
@anik120 anik120 force-pushed the fix-stale-cache branch 2 times, most recently from 1d98c49 to 24a8fbb Compare August 6, 2024 17:52
@anik120
Copy link
Contributor Author

anik120 commented Aug 6, 2024

@benluddy thanks for chiming in! It's been a fun read for a few days, but I think I have a clearer understanding of what's going on now. Many different caches at play here. I'm thinking it's probably a good idea to revisit #2337 and the efforts around it (probably #2632 too ). Will behoove us to devote some effort in reeling down the code base, for ease of maintenance going forward.

A very simple: 1. list all the resources needed from the cluster, 2. convert them to the desired formats (Installables, or whatever the latest term is), 3. "Solve", without too many caches at play sounds much more maintainable.

That's not for this PR though 😬
Opening up the PR for reviews.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2024
@bentito
Copy link
Contributor

bentito commented Aug 7, 2024

/approve /lgtm

deprecationUpdated, err = c.updateDeprecatedStatus(ctx, s.Subscription())
if err != nil {
return next, err
}
if healthUpdated || deprecationUpdated {
if deprecationUpdated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this case we don't check for error... but the health case we do now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daniel and I had a conversation about this here #3349 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean to say that this error is unchecked, I noticed it too, but left it alone since it isn't part of the PR. But since I'm in this area I could totally just add that in if that's what you were asking for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not clear because the diff cuts off the func definition, but the error is unchecked because this func uses named returns, and this is the last potential error-causing statement, so that error will be returned if there is one. I'm not defending it, I really don't like named returns, just explaining why it looks like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no that's sounds fine to me. It's one of those "there's 2 ways of doing this and both are right", but it's not related to this PR.

)

const (
BundleLookupConditionPacked v1alpha1.BundleLookupConditionType = "BundleLookupNotPersisted"
exclusionAnnotation string = "olm.operatorframework.io/exclude-global-namespace-resolution"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is now part of the "API"? Is this documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's coming from this PR that added the feature (see ref).

So it's an existing feature, I'm just moving the usage to a different area in this PR. However, now that you bring it up, fyi I also have an experiment going on here to essentially redo the PR that added this feature, using only the change I made in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoided removing the og-source-provider in this PR because it needs to be back ported and wanted to keep the scope small. But I think that makes a lot of sense as a follow up immediately.

…istry-server cache

Using "available CatalogSources" information from the registry-client cache was causing
cache inconsistency problems.

This has showed up multiple times in production environments over the years, manifesting
itself in the form of the all subscriptions in a namespace being transitioned into an error
state when a Catalogsource that the cache claims to exist, has actually been deleted from the
cluster, but the cache was not updated.

The Subscriptions are transitioned to an error state because of the deleted catalogsource
with the follwing error message:

"message": "failed to populate resolver cache from source <deleted-catalogsource>: failed to list
bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport:
Error while dialing dial tcp: lookup <deleted-catalogsource>.<ns>.svc on 172.....: no such host\"",
                "reason": "ErrorPreventedResolution",
                "status": "True",
                "type": "ResolutionFailed"

This PR switches the information lookup from the cache, to using a client to list the CatalogSources
present in the cluster.
Copy link
Contributor

@dtfranz dtfranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2024
@dtfranz dtfranz added this pull request to the merge queue Aug 7, 2024
Merged via the queue into operator-framework:master with commit ff9084a Aug 7, 2024
12 checks passed
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Aug 8, 2024
[2788](operator-framework#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.
@anik120 anik120 mentioned this pull request Aug 8, 2024
11 tasks
github-merge-queue bot pushed a commit that referenced this pull request Sep 12, 2024
[2788](#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Sep 13, 2024
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Sep 21, 2024
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Sep 25, 2024
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
openshift-bot pushed a commit to openshift-bot/operator-framework-olm that referenced this pull request Sep 26, 2024
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
perdasilva pushed a commit to perdasilva/operator-framework-olm that referenced this pull request Jan 10, 2025
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider`
that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider`
was used by the resolver as the `SourceProvider` implementation.

[3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources
in the cluster by itself on the basis of the annotation in the `OperatorGroup`
in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`,
which now list the CatalogSources using a client  instead of reaching out to the
registry-server cache.

This PR removes the `OgSourceProvider`, since the feature is now being directly implemented
in the resolver, and the `OgSourceProvider` implementation has become redundant.

Note that the feature parity is still maintained, as attested by the successful execution
of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.

Upstream-repository: operator-lifecycle-manager
Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants