-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix) Resolver: list CatSrc using client, instead of referring to registry-server cache #3349
Conversation
Picking up #2779 |
/hold I want to test this a little more before I open it up for reviews. |
Bolded to emphasize that the list of available catalogs is coming from a cache of registry-server clients, not a cache of CatalogSources. In other words, it wasn't reading stale cache entries, it was reading the wrong cache. Fixing this to actually be based on CatalogSources is a huge improvement. |
f7fea70
to
b8da86f
Compare
0d51401
to
488fd2c
Compare
1d98c49
to
24a8fbb
Compare
@benluddy thanks for chiming in! It's been a fun read for a few days, but I think I have a clearer understanding of what's going on now. Many different caches at play here. I'm thinking it's probably a good idea to revisit #2337 and the efforts around it (probably #2632 too ). Will behoove us to devote some effort in reeling down the code base, for ease of maintenance going forward. A very simple: That's not for this PR though 😬 /hold cancel |
24a8fbb
to
6168fc8
Compare
/approve /lgtm |
deprecationUpdated, err = c.updateDeprecatedStatus(ctx, s.Subscription()) | ||
if err != nil { | ||
return next, err | ||
} | ||
if healthUpdated || deprecationUpdated { | ||
if deprecationUpdated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this case we don't check for error... but the health case we do now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Daniel and I had a conversation about this here #3349 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean to say that this error is unchecked, I noticed it too, but left it alone since it isn't part of the PR. But since I'm in this area I could totally just add that in if that's what you were asking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably not clear because the diff cuts off the func definition, but the error is unchecked because this func uses named returns, and this is the last potential error-causing statement, so that error will be returned if there is one. I'm not defending it, I really don't like named returns, just explaining why it looks like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no that's sounds fine to me. It's one of those "there's 2 ways of doing this and both are right", but it's not related to this PR.
) | ||
|
||
const ( | ||
BundleLookupConditionPacked v1alpha1.BundleLookupConditionType = "BundleLookupNotPersisted" | ||
exclusionAnnotation string = "olm.operatorframework.io/exclude-global-namespace-resolution" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is now part of the "API"? Is this documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's coming from this PR that added the feature (see ref).
So it's an existing feature, I'm just moving the usage to a different area in this PR. However, now that you bring it up, fyi I also have an experiment going on here to essentially redo the PR that added this feature, using only the change I made in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoided removing the og-source-provider
in this PR because it needs to be back ported and wanted to keep the scope small. But I think that makes a lot of sense as a follow up immediately.
6168fc8
to
95ec27c
Compare
…istry-server cache Using "available CatalogSources" information from the registry-client cache was causing cache inconsistency problems. This has showed up multiple times in production environments over the years, manifesting itself in the form of the all subscriptions in a namespace being transitioned into an error state when a Catalogsource that the cache claims to exist, has actually been deleted from the cluster, but the cache was not updated. The Subscriptions are transitioned to an error state because of the deleted catalogsource with the follwing error message: "message": "failed to populate resolver cache from source <deleted-catalogsource>: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup <deleted-catalogsource>.<ns>.svc on 172.....: no such host\"", "reason": "ErrorPreventedResolution", "status": "True", "type": "ResolutionFailed" This PR switches the information lookup from the cache, to using a client to list the CatalogSources present in the cluster.
95ec27c
to
65bb4b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
ff9084a
[2788](operator-framework#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.
[2788](#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788.
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788. Upstream-repository: operator-lifecycle-manager Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788. Upstream-repository: operator-lifecycle-manager Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788. Upstream-repository: operator-lifecycle-manager Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788. Upstream-repository: operator-lifecycle-manager Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
[2788](operator-framework/operator-lifecycle-manager#2788) introduced the `OgSourceProvider` that allows for the exclusion of Catalogs in the global namespace. This `OgSourceProvider` was used by the resolver as the `SourceProvider` implementation. [3349](operator-framework/operator-lifecycle-manager#3349) modified the resolver, to list the CatalogSources in the cluster by itself on the basis of the annotation in the `OperatorGroup` in the resolving namespace, so that that information can be passed to the `RegistrySourceProvider`, which now list the CatalogSources using a client instead of reaching out to the registry-server cache. This PR removes the `OgSourceProvider`, since the feature is now being directly implemented in the resolver, and the `OgSourceProvider` implementation has become redundant. Note that the feature parity is still maintained, as attested by the successful execution of [these tests](https://github.com/operator-framework/operator-lifecycle-manager/blob/master/test/e2e/catalog_exclusion_test.go) introduced in 2788. Upstream-repository: operator-lifecycle-manager Upstream-commit: 183a7f24c1cbb9a94e7c612ab62689519c001f7f
Using "available CatalogSources" information from the registry-client cache was causing cache inconsistency problems.
This has showed up multiple times in production environments over the years, manifesting itself in the form of the all subscriptions in a namespace being transitioned into an error state when a Catalogsource that the cache claims to exist, has actually been deleted from the cluster, but the cache was not updated.
The Subscriptions are transitioned to an error state because of the deleted catalogsource with the follwing error message:
"message": "failed to populate resolver cache from source : failed to list
bundles: rpc error: code = Unavailable desc = connection error: desc = "transport:
Error while dialing dial tcp: lookup ..svc on 172.....: no such host"",
"reason": "ErrorPreventedResolution",
"status": "True",
"type": "ResolutionFailed"
This PR switches the information lookup from the cache, to using a client to list the CatalogSources present in the cluster.
Description of the change:
Motivation for the change:
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue