Making Gateway Status more descriptive #47

robscott · 2020-01-21T18:50:05Z

This adds more detail to the Gateway Status implementation. It uses the Conditions pattern as suggested earlier by @youngnick. It could result in some duplication of status since both listener and route resources could potentially have their own status fields. Since those can all be different resources with different status attributes, it seems to make sense to provide a consistent set of status fields at this level.

/cc @bowei @youngnick

api/v1alpha1/gateway_types.go

jpeach · 2020-01-24T06:41:13Z

I agree that moving towards Conditions is an improvement to the API. I'd go further and suggest that the Conditions slice should be a direct field of the gateway status and contain both route and listener conditions:

type GatewayCondition struct {
        Type               ConditionType                    `json:"type"`
        Status             core.ConditionStatus             `json:"status"`
        Message            string                           `json:"message"`
        Reason             string                           `json:"reason"`
        LastTransitionTime metav1.Time                      `json:"lastTransitionTime"`
        Ref                []core.TypedLocalObjectReference `json:"ref"`
}

// GatewayStatus defines the observed state of Gateway
type GatewayStatus struct {
        Conditions []GatewayConditions `json:"conditions"`
}

I had thought that kapp inspected .status.conditions, but it turns out to not do that for unknown CRD types. However, it looks like kustomize and octant will benefit if we follow the .status.conditions convention here.

I think that we can reasonably represent Conditions for both routes and listeners in the same slice, though tools that want to inspect the gateway state in detail would need to switch on the Condition type. Compared to the separate array of Conditions, this suggestion could be more compact (we don't need a slice element for each route and listener), but it may be harder to determine which route or listener is problematic. It is likely that a single GatewayCondition would end up having fields that are only used for certain condition types (e.g. a ConditionInvalidAddress would hopefully populate an address field), which makes it harder to validate and reason about.

The Ref field in the sketch above is intended to indicate the object that raised the Condition. For a Route, it would be the ref to the route, for a listener, possibly the ref from the extension field. I haven't yet searched for prior art around this kind of reference, but hopefully there is some.

See also:

danehans · 2020-01-24T17:40:47Z

@robscott as mentioned during yesterday's call, we need a way to summarize route status for scalability purposes. The docs should also be explicit about mandatory/optional fields.

api/v1alpha1/gateway_types.go

danehans · 2020-01-28T18:38:40Z

api/v1alpha1/gateway_types.go

@@ -179,8 +178,6 @@ type ListenerTLS struct {

 // GatewayStatus defines the observed state of Gateway
 type GatewayStatus struct {
-	// TODO overall status


I think it's important that overall GatewayStatus is defined in this PR to gain a full understanding of status. I'm using Pod status as a reference model and it feels like we currently have the scope of status reversed. Instead of surfacing conditions in the subresources, maybe conditions should be inGatewayStatus. For example:

status: conditions: - lastTransitionTime: "2020-01-27T17:13:49Z" status: "True" type: Ready - lastTransitionTime: "2020-01-27T17:13:49Z" status: "True" type: ListenersReady - lastTransitionTime: "2020-01-27T17:13:10Z" status: "True" type: GatewayScheduled listenerStatuses: - <listener[0] status> routeStatuses: - <route[0] status>

Should ListenerStatus and GatewayRouteStatus follow a similar approach to ContainerStatus? For example:

// RouteStatus represents the status of a route type RouteStatus struct { // Each route in a namespace must have a unique name. NamespacedName types.NamespacedName // +optional // This is where ConditionInvalidRoute and ConditionNoSuchRoute can be used. State RouteState // Should a route have some type of readiness check? Ready bool // +optional RouteID string } // ListenerStatus represents the status of a listener. type ListenerStatus struct { // Address bound on this listener. Address *ListenerAddress `json:"address"` // State reflects if the listener is good or bad (bad address, port, protocol, cert ref, etc..) // This is where ConditionInvalidAddress etc. can be used. // +optional State ListenerState // Should a listener have some type of readiness check? Ready bool }

I agree with the general idea of having a status.Conditions set that indicates if there are problems with listeners and routes, but I think that, following the API conventions, they should probably be ListenersNotReady, RoutesNotReady, GatewayClassInvalid, and GatewayNotScheduled.

Edit: I used NotReady on purpose, since that means there is something wrong with at least one listener or route. It may be the case that some are working, but some are not. It's intended to be a note that you should check the specific stanzas for details. Naming could probably use some work though.

This is great feedback, thanks! I think my latest update incorporates most of this, but let me know if I've missed anything.

Note that for things that are continuously reconciled, the "well-known" types from Deployments and ReplicaSets (and stateful sets and daemonsets) should be considered - primarily "Progressing" and "Available" (available may be more relevant here).

In general the pattern of kubectl wait --for=available is specifically designed to deal with the problem of "how do I know when my changes have been made available after an update" in concert with ObservedGeneration. I would recommend at least having an Available condition that matches the requested semantics (when an ingress controller has accepted or begun filling traffic) and ObservedGeneration int64 set by an ingress controller to indicate that the requested spec has been observed and that status conditions apply to that generation of the ingress.

I like the idea of reusing these types, as well as any other relevant ones.

robscott · 2020-01-30T01:40:39Z

Thanks to everyone for the great feedback! I've pushed an update that makes some significant changes:

Listeners have a new required name field.
ListenerStatus is now linked using that name.
RouteStatus is gone.
A new top level Conditions list has been added.

For 1 and 2, I realized that linking by index would be very fragile. Listeners can always be added or removed, and that would make it difficult to keep status meaningful. The best example I can find of this pattern is that of ContainerStatus for Pods which also relies on a name attribute to link the status and the container. I know requiring a name on listeners is unfortunate, but it is important to provide a status for each listener that can be reliably tied together. I can't think of a better way to accomplish this, but I'm open to ideas.

For 3, it became clear that keeping status per route would be hard to scale. Unlike listeners, routes are just referring to other API resources, granular status can live on those resources.

4 helps some with 3 by providing a new top level set of conditions, including a way to indicate if there are any listeners or routes that are invalid. It also allows for new condition types that are specific to the Gateway as a whole.

Thanks again to everyone who left feedback on the first proposal here, please take another look whenever you get a chance.

youngnick · 2020-01-30T03:11:49Z

/lgtm

Nice work. I agree that forcing listeners to have a name is a little annoying, but as you say, it's similar to the container in Pods, so it's not unknown. And allowing the ListenerStatus to be retrievable via name is excellent.

I think those Conditions will be great for tool-based parsing of the status.

jpeach

I think that the new .status.conditions is a good improvement.

I did poke a bit at duck-typing the .status.conditions field so that it could contain a different types of Condition (that all shared the common fields), but whilst this is do-able in raw YAML/JSON, I could not find a way for the Kubernetes code generators to know about anything other than the base Condition type. Possibly the unions KEP could help.

api/v1alpha1/gateway_types.go

jpeach · 2020-01-30T05:14:58Z

api/v1alpha1/gateway_types.go

+	// listeners os not ready.
+	ConditionListenersNotReady GatewayConditionType = "ListenersNotReady"
+	// ConditionInvalidListeners indicates that at least one of the specified
+	// routes is invalid.


s/routes/listeners/

If the ConditionInvalidListeners is raised, then there must also be a corresponding entry in .status.conditions.listeners.conditions. We should document that here.

Good call, thanks! I also added a similar requirement for the ListenersNotReady condition along with a new ListenerNotReady condition that can be set for specific listeners.

jpeach · 2020-01-30T05:30:15Z

api/v1alpha1/gateway_types.go

+	ConditionRoutesNotReady GatewayConditionType = "RoutesNotReady"
+	// ConditionInvalidRoutes indicates that at least one of the specified
+	// routes is invalid.
+	ConditionInvalidRoutes GatewayConditionType = "InvalidRoutes"


How does the operator/tooling know which route was invalid? There probably should be some way to link this to the broken route that is more deterministic than scanning the Message. Maybe the same approach as with Listeners (specific route conditions in .status.routes.conditions) could work.

I think that it is fine to leave a TODO and defer this to a separate issue.

Yeah routes are tough to get right here. Theoretically all status for an individual route could exist on that API resource. The potential quantity of routes that could be contained within a Gateway makes me hesitant to add status for each one within the Gateway resource. It seems like we'd be at risk of hitting object size limits here. I think how we handle route status here should be dependent on what we determine is a reasonable limit for the number of routes that should be contained/referenced within a Gateway. I do agree that it would be nice to have status for all the things in one place, so if we can accomplish that, it would be good. Like you suggested, this may be better for a TODO and/or separate issue.

danehans · 2020-01-30T18:41:56Z

@howardjohn thanks for iterating on the PR. I had a few nits, otherwise it looks good to me.

docs-src/concepts.md

api/v1alpha1/gateway_types.go

docs-src/concepts.md

Miciah · 2020-01-30T21:05:50Z

docs-src/concepts.md

+Within GatewayStatus, Listeners will have status entries corresponding to their
+name. Both GatewayStatus and ListenerStatus follow the conditions pattern used
+elsewhere in Kubernetes. This is a list that includes a type of condition, the
+status of that condition, and the last time this condition changed.


"GatewayStatus", "Listener", and "ListenerStatus" need mark-up for consistency with the rest of the document: `GatewayStatus`, `Listener`, `ListenerStatus`.

Suggested rewording for this sentence: "Namely, the conditions field of GatewayStatus or ListenerStatus has a list of conditions, where each condition has a unique type, a status, the last time the condition changed, and optionally a reason and message with more details about the condition."

I think I got all of these, but there were some I wasn't quite sure on. It looks like we have some inconsistency throughout around where we use backticks or not.

evankanderson · 2020-01-30T21:11:38Z

(I found this from some discussion elsewhere about conditions.)

If you expect end-users to read your status.conditions, I'd suggest switching the polarity of your conditions to indicate positive statuses. This seems to be easier for users to understand, particularly when they are debugging issues. In Knative, we found that users tended to get upset when they read that their Revision was "Failed" (with value "False"); it's possible that they found the double-negative confusing.

When standardizing our Conditions, we deliberately switched to a "True" polarity because users won't read the value when they're upset that something may not be working. Remember, users may be in a charged emotional state and only processing about 1/4 of what they see on the screen, especially during critical times:

They're in a big demo
Their site is down
There's a security fix that needs to go out
It's 3pm and they're late to pick up their kids from daycare

If you're curious, we also wrote a duck-typing library for managing Conditions:

https://godoc.org/knative.dev/pkg/apis#ConditionManager

And documentation is here:

https://knative.dev/docs/serving/spec/knative-api-specification-1.0/#error-signalling

Another place where we ended up extending the Kubernetes convention is that we added a "Severity" field to allow passing back non-error warning information; this is useful for conditions which may be sub-optimal but not explicitly failed (e.g. a Gateway that uses a DNS name, but the DNS name is pointed to an unexpected IP).

bowei · 2020-01-30T21:25:22Z

Yes -- we have found the advice counter intuitive to follow. This should be raised to sig-arch as it seems like many resources are already non-compliant.

smarterclayton · 2020-01-30T21:43:06Z

The advice is wrong, we just haven't consolidated the case law down. Use positive conditions that mirror fundamental states of your resource, ideally matching patterns established in workload controllers (which have received the most TLC on this).

ironcladlou · 2020-01-30T21:59:32Z

api/v1alpha1/gateway_types.go

+const (
+	// ConditionNoSuchGatewayClass indicates that the specified GatewayClass
+	// does not exist.
+	ConditionNoSuchGatewayClass GatewayConditionType = "NoSuchGatewayClass"


For some of these that apply to Gateway itself and not some subset of related things (e.g. Listener, Route) I'm just looking for some help understanding conventions...

It seems to be enumerating possible states which abstractly represent unavailability. How does this approach of representing them as condition types compare to representing them as reasons associated with a general "Unavailable" condition type? One reason might be because the Gateway is unavailable for multiple reasons.

However, even if using multiple condition types is preferable, does there still need to be a higher level condition that represents, in aggregate, "the gateway is unavailable"? Otherwise, which permutation of condition types on the Gateway represents total/general unavailability and how would users know that?

I think it's that by defining types we are making it clear what the list of conditions should be. The types are just a proxy for defining the Reason field anyway. Having the types means that we are constructing a canonical list of Reasons that the API defines. I don't know if this is the best way to do it, but sounds like this is pretty up in the air (from Clayton's comment earlier), so we will be also defining the best practice to an extent in this repo.

How does that interact with 'Use positive conditions that mirror fundamental states of your resource'? I'm not sure.

Thanks, Nick. To be clear, it sounds like this doesn't translate into any specific requirement/contract for implementations (i.e. "implementations must publish [condition set]"), but they should implement them consistently?

I think so.

jpeach · 2020-01-30T23:12:37Z

On Jan 31, 2020, at 8:43 AM, Clayton Coleman ***@***.***> wrote: The advice is wrong, we just haven't consolidated the case law down. Use positive conditions that mirror fundamental states of your resource, ideally matching patterns established in workload controllers (which have received the most TLC on this).

This seems like a serious issue. Evenyone has to agree on how to handle the polarity of Conditions for them to be interoperable. How are projects supposed to know they should do the opposite of the official advice? I filed kubernetes/community#4471

youngnick · 2020-01-30T23:13:09Z

@smarterclayton, it seemed to me like part of the intent with having conditions be negative was that if the conditions slice was empty, then everything was okay. I can see that this is a slightly weird way of doing things, and I understand @evankanderson's point about the UX. Is this something that needs more discussion, a KEP or something, or something decided already?

If it's all decided, does the api conventions document need updating? I can have a crack at making it less confusing if need be. That will be useful for this project in that it will prevent arguments about matching the conventions doc before they happen.

robscott · 2020-01-31T21:27:45Z

Thanks to everyone for the great feedback here! I think this PR has reached a "good enough for now" state, with the idea that we can make changes in the future as we find the shortcomings of this approach. A change that will likely come in the future is a transition to positive conditions. My personal preference is to wait for sig-arch to formalize their updated opinion on this in the docs before we make that switch here.

youngnick · 2020-02-03T01:58:42Z

As I said on the call, if positive conditions are the go, I would like to switch as soon as possible. But, I'm in agreement that we should get this merged first, and switch them later. This is all alpha anyway.

/lgtm

smarterclayton · 2020-02-03T21:28:23Z

@smarterclayton, it seemed to me like part of the intent with having conditions be negative was that if the conditions slice was empty, then everything was okay. I can see that this is a slightly weird way of doing things, and I understand @evankanderson's point about the UX. Is this something that needs more discussion, a KEP or something, or something decided already?

In general, if you can have a positive or negative condition, try to focus on positive conditions first, then if you need additional special case stuff negative conditions are ok, but not necessarily encourage. Trying to find the most top level generic "positive condition" is a useful exercise of ensuring you understand what data your user cases about (exceptional cases are by definition not what users care about).

smarterclayton · 2020-02-03T21:30:21Z

In general an empty conditions slice is what I would consider a bug. You should report at least one positive condition that indicates when you are meeting user intent and when not. Certainly in api review I would hold approval on an empty conditions array.

bowei · 2020-02-03T22:10:48Z

/approve
/lgtm

k8s-ci-robot · 2020-02-03T22:11:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bowei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from bowei and youngnick January 21, 2020 18:50

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 21, 2020

robscott force-pushed the gateway-status branch 3 times, most recently from 841f1b4 to 2b8bd7e Compare January 21, 2020 19:21

howardjohn reviewed Jan 22, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

robscott force-pushed the gateway-status branch from 2b8bd7e to 2bbba23 Compare January 22, 2020 20:10

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2020

danehans reviewed Jan 24, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Show resolved Hide resolved

danehans reviewed Jan 24, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Show resolved Hide resolved

danehans reviewed Jan 24, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

danehans reviewed Jan 28, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

danehans reviewed Jan 28, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

danehans reviewed Jan 28, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

danehans reviewed Jan 28, 2020

View reviewed changes

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved

danehans reviewed Jan 28, 2020

View reviewed changes

robscott force-pushed the gateway-status branch from 2bbba23 to 6bcd52a Compare January 30, 2020 01:23

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2020

robscott force-pushed the gateway-status branch from 6bcd52a to 20532fa Compare January 30, 2020 01:45

k8s-ci-robot assigned youngnick Jan 30, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2020

jpeach reviewed Jan 30, 2020

View reviewed changes

robscott force-pushed the gateway-status branch from 20532fa to 5bc6b0e Compare January 30, 2020 18:29

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2020

robscott force-pushed the gateway-status branch from 5bc6b0e to b9a88a3 Compare January 30, 2020 19:00

Miciah reviewed Jan 30, 2020

View reviewed changes

ironcladlou reviewed Jan 30, 2020

View reviewed changes

jpeach mentioned this pull request Jan 30, 2020

API conventions give incorrect Conditions advice kubernetes/community#4471

Closed

robscott force-pushed the gateway-status branch from b9a88a3 to 6aec3b1 Compare January 31, 2020 20:24

Making Gateway Status more descriptive

e4d3627

robscott force-pushed the gateway-status branch from 6aec3b1 to e4d3627 Compare January 31, 2020 20:35

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2020

k8s-ci-robot assigned bowei Feb 3, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2020

k8s-ci-robot merged commit 12ca3b8 into kubernetes-sigs:master Feb 3, 2020

evankanderson mentioned this pull request Feb 14, 2020

Update Condition guidance kubernetes/community#4521

Merged

danehans mentioned this pull request Feb 17, 2020

Reverses Gateway Condition Polarity #104

Closed

robscott deleted the gateway-status branch January 8, 2022 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Gateway Status more descriptive #47

Making Gateway Status more descriptive #47

robscott commented Jan 21, 2020

jpeach commented Jan 24, 2020

danehans commented Jan 24, 2020

danehans Jan 28, 2020

youngnick Jan 30, 2020 •

edited

Loading

robscott Jan 30, 2020

smarterclayton Jan 30, 2020

youngnick Feb 3, 2020

robscott commented Jan 30, 2020

youngnick commented Jan 30, 2020

jpeach left a comment

jpeach Jan 30, 2020

jpeach Jan 30, 2020

robscott Jan 30, 2020

jpeach Jan 30, 2020

robscott Jan 30, 2020

danehans commented Jan 30, 2020

Miciah Jan 30, 2020

robscott Jan 31, 2020

evankanderson commented Jan 30, 2020

bowei commented Jan 30, 2020

smarterclayton commented Jan 30, 2020

ironcladlou Jan 30, 2020

youngnick Jan 30, 2020

ironcladlou Jan 31, 2020

youngnick Jan 31, 2020

jpeach commented Jan 30, 2020 via email

youngnick commented Jan 30, 2020

robscott commented Jan 31, 2020

youngnick commented Feb 3, 2020

smarterclayton commented Feb 3, 2020

smarterclayton commented Feb 3, 2020

bowei commented Feb 3, 2020

k8s-ci-robot commented Feb 3, 2020

Making Gateway Status more descriptive #47

Making Gateway Status more descriptive #47

Conversation

robscott commented Jan 21, 2020

jpeach commented Jan 24, 2020

danehans commented Jan 24, 2020

Choose a reason for hiding this comment

youngnick Jan 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robscott commented Jan 30, 2020

youngnick commented Jan 30, 2020

jpeach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danehans commented Jan 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evankanderson commented Jan 30, 2020

bowei commented Jan 30, 2020

smarterclayton commented Jan 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpeach commented Jan 30, 2020 via email

youngnick commented Jan 30, 2020

robscott commented Jan 31, 2020

youngnick commented Feb 3, 2020

smarterclayton commented Feb 3, 2020

smarterclayton commented Feb 3, 2020

bowei commented Feb 3, 2020

k8s-ci-robot commented Feb 3, 2020

youngnick Jan 30, 2020 •

edited

Loading