Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Condition guidance #4521

Merged
merged 9 commits into from
Sep 24, 2020
Merged

Conversation

evankanderson
Copy link
Contributor

Which issue(s) this PR fixes:
Fixes #4471

I took a first stab at updating these conventions based on kubernetes-sigs/cluster-api#1658 (comment) and kubernetes-sigs/gateway-api#47 (comment).

/assign @smarterclayton @liggitt

I'm happy to recommend different top-level conditions type names. Note that I recommended Succeeded over Complete as a top-level condition, because Complete doesn't indicate whether the Job succeeded or failed.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @evankanderson. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/developer-guide Issues or PRs related to the developer guide sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels Feb 14, 2020
@youngnick
Copy link

When looking at this across a bunch of types, I wondered how we could possibly reconcile the history of Conditions in a way that would be machine-parseable easily. The best I could come up with was recommending that new Conditions add a polarity field (or whatever the name ends up being) that indicates if Status True or False is the healthy condition. If that is assumed to be False when not present, then a library could make a fair guess at what the Status means regardless of when it was created.

In no way intended to block this PR, which I fully support.

@mrbobbytables
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 23, 2020
@evankanderson
Copy link
Contributor Author

evankanderson commented Feb 24, 2020

(I thought I'd sent an update from mobile while I was on vacation, but I either hallucinated it or technology ate it)

I'd like to defer the discussion of how to migrate existing controllers into the new pattern to a subsequent PR, possibly as a KEP or introduction of common infrastructure for managing Conditions. However, since you brought it up, my preference would be to introduce a severity field which could be used for both ErrorIfTrue and ErrorIfFalse signalling, but also indicate that some Conditions provide information without signalling a fatal error.

@danehans
Copy link

@evankanderson thank you for putting this PR together. kubernetes-sigs/gateway-api#104 is on hold until this merges. @smarterclayton @liggitt is anything needed to move this PR forward?

@therealmitchconnors
Copy link

These arguments are compelling, but run opposite of previous guidance used for projects like kstatus. Can we get sig-architecture clarification on whether healthy conditions should be true or false?

@erictune
Copy link
Member

Big fan of this change.

Comment on lines 394 to 395
An example of an oscillating condition type is `Ready` (despite it running
afoul of current guidance), which indicates the object was believed to be fully
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this changes the guidance around positive/negative conditions, we can drop the (despite it running afoul of current guidance) parenthetical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, updated.

Copy link
Contributor Author

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you approve this, or is there someone else I should chase?

Comment on lines 394 to 395
An example of an oscillating condition type is `Ready` (despite it running
afoul of current guidance), which indicates the object was believed to be fully
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, updated.

@youngnick
Copy link

@therealmitchconnors This PR is codifying the answers you'll get from people if you ask around into the official docs, so we can all be on the same page from now on. Any stuff that's already done will stay how it is, but when I asked sig-arch and sig-api-machinery, I was told this was the new way. I agree that having someone confirm that here would be ideal though.

@liggitt
Copy link
Member

liggitt commented Apr 1, 2020

Comment on lines 334 to 336
* Conditions should have a consistent polarity. **A "positive" polarity where
"True" indicates normal operation and "False" indicates a failure is
recommended.** (Note that this is an explicit reversal from earlier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify this is a guide for choosing condition names and polarity, not a guide for consuming arbitrary conditions. As written, it sounds like I can write a client that assumes all conditions with a False status represent failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to reach that point sometime in the future, but you're correct that we can't do this for a long time, if ever.

Adjusted this to suggest that consistency within a resource was most important consideration, and positive polarity should be chosen if there isn't already a consistent polarity.

may require adjusting condition type names, for example changing the type
name from "QuotaExhausted" to "ResourcesAvailable".

* The `Unknown` status is the default, and typically indicates that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term default is overloaded... this makes it sound as though a condition that omits the status field has its status defaulted to Unknown, which is not the case. The phrase later is clearer:

The absence of a condition should be interpreted the same as Unknown.

Comment on lines 348 to 351
* It's helpful to have a common top-level condition which summarizes more
detailed conditions. Simple consumers may simply query the top-level
condition. The `Ready` and `Succeeded` condition types may be used for
long-running and bounded-execution objects, respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also needs to be clearly oriented at the condition producer; as phrased it could be misunderstood by a simple consumer to mean that they can consistently expect to be able to use Ready / Succeeded conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is the long-term situation that I'd like to steer the world towards, but you're correct that this could be read by many consumers to demand more from API implementers than is likely in the short term.

Amended, PTAL.

Comment on lines 367 to 373
* For state transitions which take a long period of time, use a condition
which indicates that the transition succeeded ("Provisioned" or
"SizeAllocated") with an intermediate state of `Unknown`. Note that this
pattern makes an explicit distinction between "condition not set" (which
means the reconciler hasn't seen the latest state) and "condition set to
Unknown" (which means that the reconciler has seen the state but hasn't
finished the reconciliation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess your goal is that some watcher can observe progress and know that it's done? You should document how long after the operation the condition should be left there.

But I'm not convinced this is the way to do it, what if there are two resizes back to back? I don't think you can really have conditions 1:1 with operations.

I think "Resizing=true" that is present during the resize and removed after the resize is also a legitimate--and possibly less confusing--way to represent this. Seeing "ResizingSucceeded=Unknown" is disconcerting, even if it's technically accurate.

Maybe the way to strengthen your point is to talk about how the failure of the operation will be represented. I think you can't really do that with a "Reszing" condition, but if you have another way of representing failure, it's OK. If you need a way to represent failure, then "ResizingSucceeded=false" works better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my concern is that the watcher can determine whether progress is being made in a generic way. (Particularly in the presence of CRDs and a variety of types, detecting "controller is stuck" vs "backend system is slow" can be quite difficult.)

I'd suggest not removing the condition when the operation is done (hence, Provisioned or SizeAllocated). The goal is twofold:

  1. Presence of the condition along with LastTransitionTime helps to debug whether the controller has seen the most recent spec.

  2. Ability to signal success and failure of these operations in a generic way that will stick with the resource seems useful for the "hey, could you help debug this with me" case -- a resize failed condition that disappears after 2 hours might be gone by the time an expert takes a look at the correct resource in the cluster.

As for the human ergonomics, I was hoping that the Reason and Message fields would suffice:

status:
  conditions:
  - name: Resizing
    status: True
    Reason: FileSystemExtensionInProgress
    Message: The system is currently extending the filesystem on the disk after resize.
status:
  conditions:
  - name: SizeAvailable
    status: Unknown
    Reason: FileSystemExtensionInProgress
    Message: The system is currently extending the filesystem on the disk after resize.

(Note that I changed this to SizeAvailable because there's probably a distinction between allocation failures and filesystem extension failures.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a way to say this is something like "conditions should not be transient, they should more or less always be on the object with the appropriate value".

But this doesn't rule out something like "Resizing", you can just set it False when the object is the correct size. At first it seems like people would need to look at the Reason to know if it's a "good" false (no resize needed) or a "bad" false (resize failed), but in the latter case we can just leave it in "resizing" (Reason is e.g. "stuck", message is "resize operation can't complete due to error foo").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me -- my concern was more specifically that the set of signalled Conditions not be transient, which this addresses. PTAL!

@evankanderson
Copy link
Contributor Author

As a followup to this PR, I notice that Eric's analysis highlights a number of cases where developers are prefixing their Condition types, ala AppDBInstanceReady for a resource named AppDBOperator. Are we tired of swinging at this, or would a subsequent PR along the lines of smarterclayton's comment be a good follow-on?

Conditions summarize characteristics of object lifecycle that must be generic across multiple types, have a need to convey both state and human message, and can be extended by third parties to add more nuance to existing lifecycle.

An object in Kube is a distributed state machine. The state specified to that object should always be fields. Some objects have consistent fields that describe status of the state machine. One of those is observedGeneration. Another one is conditions. Some conditions define generic, multi-object state machine characteristics like Available, Progressing, Ready.

(I don't want to derail this PR with extra load at this point, but it's been bugging me since I came back to this.)

could be `Failed`. A `True` status for `Failed` would imply failure with no
retry. An object that was still active would generally not have a `Failed`
condition.
An example of an oscillating condition type is `Ready`, which indicates the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resizing is another natural example, but no need to add it here. :)

@lavalamp
Copy link
Member

/lgtm
/approve

Let's get this in before anyone thinks of another thing to change. There can always be a follow-up!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2020
@lavalamp
Copy link
Member

Are we tired of swinging at this, or would a subsequent PR along the lines of smarterclayton's comment be a good follow-on?

I bet we can do a followup in less time than this PR took :)

@k8s-ci-robot k8s-ci-robot merged commit 04531ba into kubernetes:master Sep 24, 2020
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still super torn on this. This PR has been reduced almost beyond arguability, but still...

I don't think we have a clear rule for when some datum wants to be a field or a condition. Half of the examples I see here would, IMO, be better as fields. So why are we suggesting conditions for them?

@lavalamp
Copy link
Member

I agree with that, but this PR has been pending for more than half a year and is an improvement over the existing text.

@dims
Copy link
Member

dims commented Sep 25, 2020

@thockin if you want to throw some outline of what you want to see, i can try to convince @evankanderson to do a follow up :)

@evankanderson
Copy link
Contributor Author

I'd be happy to schedule a chat with whomever is interested to try to work out some better / additional guidance.

I can say that we've found the structure of Conditions (short code plus human-readable text and the tri-state Status) very useful for communicating to humans, so I'd like to preserve that.

@dims
Copy link
Member

dims commented Sep 26, 2020

@evankanderson let's line it up for next sig-arch?

@thockin
Copy link
Member

thockin commented Sep 26, 2020 via email

@lavalamp
Copy link
Member

Practical reasons not to apply this to all fields: that go doesn't do templates (yet), and objects would get very large.

One practical reason to keep it a bool is that it's simple to express what you want to watch for with kubectl tools.

My current guidance would be, conditions are good for:

  • Top-level facts about the object,
  • Which can change over the lifetime of the object (i.e., could be a valid target to wait for),
  • And are not overly detailed (i.e., don't make a separate condition for the version of every image cached on the node),
  • And have a unique owner (i.e., each condition has a single setter, but it's ok to have different setters for different conditions).

I think the "too detailed" and "valid target to wait for" tests are subjective and require some taste.

I think there's a bit of an open question when multiple instances of the same binary want to vote/set a condition; apiserver does this in multiple places and I'm not sure we're super happy about it. @deads2k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/developer-guide Issues or PRs related to the developer guide cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API conventions give incorrect Conditions advice