Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness doesn't account for metadata.observedGeneration #3474

Closed
tombentley opened this issue Sep 17, 2021 · 10 comments
Closed

Readiness doesn't account for metadata.observedGeneration #3474

tombentley opened this issue Sep 17, 2021 · 10 comments

Comments

@tombentley
Copy link
Contributor

Readiness and OpenshiftReadiness don't check, when using the status of resources, whether the status.observedGeneration==metadata.generation. Therefore there is the risk that, if the resource is updated and then readiness-checked before a controller is able to act on the update, the readiness will be evaluated incorrectly.

@shawkins
Copy link
Contributor

@tombentley and others, I don't quite follow what is needed here. The current Readiness check is just a point in time check - there's no expectation that it will wait for a controller to act. Can you provide additional context about where you are using the readiness - and if you need that to act over time could that be satisfied instead with waitUntilReady or waitUntilCondition?

@tombentley
Copy link
Contributor Author

@shawkins sorry, your reply got lost in my inbox... The problem is a resource might:

  1. be "ready"
  2. then be modified (i.e. spec changed)
  3. before the controller has reacted to the modification (thus before the controller has updated the status), readiness is checked again.

Kubernetes uses the status.observedGeneration to provide a way of telling if the status is out of date with respect to the spec (not all built-in resources have an observedGeneration in their status, but some do and it's a recommended pattern for operators): If status.observedGeneration == metadata.generation then you know the status applies to the current spec. If they're different then you know the status is stale and can't be trusted (presumably it will change at some time in the future once the relevant controller has acted and updated the status).

Fabric8's Readiness class doesn't do this check right now (on those resources which support observedGeneration), so in step 3 Readiness will decide the resource is "ready" when it merely was ready, but isn't ready with respect to the most recent change, and therefore is not actually "ready".

@shawkins
Copy link
Contributor

Sorry, I think most of my confusion is in terms of marking this as a bug. It sounds more like you are looking for an enhancement or should be implemented with your own condition. To make sure we're on the same page there are three states you are interested in:

var resource = client.get();
if (resource.getMetadata().getGeneration() != resource.getStatus().getObservedGeneration()) {
  // neither ready nor not ready
} else if (Readiness.getInstance().isReady(resource)) {
  // ready
} else {
  // not ready
}

However this additional state may not be meaningful depending on your design. If you are instead reacting to the events from the resource in question, you will get an event once it updates the status based upon the spec modification. That new event will cause you to re-evaluate the readiness.

The difficulty with putting something like this into the Readiness utility is:

  1. it's currently two valued - arguable there should be an unknown result to match status condition semantics, but in your case it sounds like it's also meaningful why it's unknown.
  2. observed generation is not a standard field, nor is there a base class for status. The status object would need to be dealt with in the raw form, or a new marker interface etc. would need to be introduced. It looks like Automatically handle .status.observedGeneration  operator-framework/java-operator-sdk#563 already moved down the latter path - which we probably need to coordinate on.

This issue could be highly related to #2718 - in that making the Readiness check pluggable might give you the dsl support you want.

@tombentley
Copy link
Contributor Author

@shawkins I guess I'm wondering why, for those resource kinds where Readiness and OpenshiftReadiness apply and where the status has an observedGeneration (I think that's Deployment, ReplicationController, StatefulSet, ReplicaSet and DeploymentConfig, but not, for example, Pod or Endpoints), this check isn't being done already. I.e. how do you know Readiness and OpenshiftReadiness are not returning false positives?

But then I don't really think of readiness as a three-valued quality. To me, a resource is only ready when status.observedGeneration == metadata.generation && <resource-specific-condition>. I.e. weird indeterminate states are just not "ready" (I don't care whether the resource was ready).

The fact that observedGeneration is not a standard field is precisely why it should be included in Readiness: If it's not encapsulated in the Fabric8 API then you're forcing people to a) know that Readiness doesn't do this itself and b) implement it themselves.

@shawkins
Copy link
Contributor

shawkins commented Nov 1, 2021

this check isn't being done already. I.e. how do you know Readiness and OpenshiftReadiness are not returning false positives?

It's not a false positive - readiness in general is independent of spec generations. For example when a resource provides a status condition for readiness, there is no guideline that a user should also update that status condition when they modify the spec. The resource remains in its given readiness state until the controller determines a change is needed - and just because there is a spec modification doesn't mean that there will be a transition. I'm also trying to convey that from an eventually consistent perspective it shouldn't matter if there are possible pending transitions in readiness or other checks - you know that it will be re-evaluated once the appropriate event is processed or the next timed reconciliation loop occurs. @manusa @rohanKanojia can you weigh in here in case I'm off base?

Can you elaborate on how your logic needs to handle this situation? What does your logic do if something is not ready? And if it would be sufficient to support #2718 to allow for extensions / customizations to the Readiness checks?

The fact that observedGeneration is not a standard field is precisely why it should be included in Readiness: If it's not encapsulated in the Fabric8 API then you're forcing people to a) know that Readiness doesn't do this itself and b) implement it themselves.

I agree that observedGeneration should be handled in fabric8 - we need to get with the operator sdk folks on a shared solution. I'm just not sure about how or whether it belongs in the readiness check.

@tombentley
Copy link
Contributor Author

you know that it will be re-evaluated once the appropriate event is processed or the next timed reconciliation loop occurs.

In our use case we've just updated the resource and we want to know that the change has taken effect, and the resource is still ready after the change has been applied, before proceeding with our operator logic. We initially just checked for the ready status, but that was racy for precisely the reasons discussed here, so we added our own check against the observed generation (e.g. strimzi/strimzi-kafka-operator#1814), and it was when we were looking at refactoring that in strimzi/strimzi-kafka-operator#5525 that I started with wonder why we were having to do this ourselves.

I take your point about whether a change which doesn't actually change the readiness should be counted as a transition. For our purposes we don't care -- it's the eventual state that we're waiting for, but other use cases might differ.

@manusa
Copy link
Member

manusa commented Nov 2, 2021

From my point of view I agree with this statement:

To me, a resource is only ready when status.observedGeneration == metadata.generation && .

I'm not sure about this, but it's my understanding that there's no standard defining the "readiness" of a specific resource. This is a subjective concept.

So IMHO a resource is ready when the subjective condition is matched and the status where that condition is checked is consistent with the current generation/version/lock of the resource (reconciled). (again I'm assuming there's no spec defining this).

@manusa manusa removed the bug label Nov 2, 2021
@manusa
Copy link
Member

manusa commented Nov 2, 2021

After a meeting with @shawkins the following point was clarified:

It's not a false positive - readiness in general is independent of spec generations. For example when a resource provides a status condition for readiness, there is no guideline that a user should also update that status condition when they modify the spec. The resource remains in its given readiness state until the controller determines a change is needed - and just because there is a spec modification doesn't mean that there will be a transition.

A spec change will imply a status change once the controller processes the resource. As stated, the fact that the spec has changed doesn't imply that the resource is unready (maybe the change doesn't affect the readiness condition for that given resource). It's only after the controller has acted that we should then verify the readiness (stale status !== unready state).

As a future step, in order to provide better support for the observedGeneration field, we could add the interface to our Status classes. The Golang generators could analyze the status structs and mark the status with the new interface in case the field exists. This would be helpful for downstream projects (e.g. JOSDK #563 #628) (@csviri @metacosm)

@metacosm
Copy link
Collaborator

metacosm commented Nov 4, 2021

Readiness is a tricky issue and I have been trying to figure out what the best practices are around the topic at some point and promptly gave up because the information I could find was contradictory or outdated… I have read https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties but these have changed over time so it's difficult to get an accurate picture of what people are supposed to do… then you see this: kubernetes/community#4521 and comments such as this one: https://github.com/kubernetes/community/pull/4521/files#r413087880… in essence, if you're not following the issue closely, good luck figuring things out :)

@stale
Copy link

stale bot commented Feb 2, 2022

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants