Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Gateway Status more descriptive #47

Merged
merged 1 commit into from
Feb 3, 2020

Conversation

robscott
Copy link
Member

This adds more detail to the Gateway Status implementation. It uses the Conditions pattern as suggested earlier by @youngnick. It could result in some duplication of status since both listener and route resources could potentially have their own status fields. Since those can all be different resources with different status attributes, it seems to make sense to provide a consistent set of status fields at this level.

/cc @bowei @youngnick

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 21, 2020
@robscott robscott force-pushed the gateway-status branch 3 times, most recently from 841f1b4 to 2b8bd7e Compare January 21, 2020 19:21
@jpeach
Copy link
Contributor

jpeach commented Jan 24, 2020

I agree that moving towards Conditions is an improvement to the API. I'd go further and suggest that the Conditions slice should be a direct field of the gateway status and contain both route and listener conditions:

type GatewayCondition struct {
        Type               ConditionType                    `json:"type"`
        Status             core.ConditionStatus             `json:"status"`
        Message            string                           `json:"message"`
        Reason             string                           `json:"reason"`
        LastTransitionTime metav1.Time                      `json:"lastTransitionTime"`
        Ref                []core.TypedLocalObjectReference `json:"ref"`
}

// GatewayStatus defines the observed state of Gateway
type GatewayStatus struct {
        Conditions []GatewayConditions `json:"conditions"`
}

I had thought that kapp inspected .status.conditions, but it turns out to not do that for unknown CRD types. However, it looks like kustomize and octant will benefit if we follow the .status.conditions convention here.

I think that we can reasonably represent Conditions for both routes and listeners in the same slice, though tools that want to inspect the gateway state in detail would need to switch on the Condition type. Compared to the separate array of Conditions, this suggestion could be more compact (we don't need a slice element for each route and listener), but it may be harder to determine which route or listener is problematic. It is likely that a single GatewayCondition would end up having fields that are only used for certain condition types (e.g. a ConditionInvalidAddress would hopefully populate an address field), which makes it harder to validate and reason about.

The Ref field in the sketch above is intended to indicate the object that raised the Condition. For a Route, it would be the ref to the route, for a listener, possibly the ref from the extension field. I haven't yet searched for prior art around this kind of reference, but hopefully there is some.

See also:

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2020
@danehans
Copy link
Contributor

@robscott as mentioned during yesterday's call, we need a way to summarize route status for scalability purposes. The docs should also be explicit about mandatory/optional fields.

@@ -179,8 +178,6 @@ type ListenerTLS struct {

// GatewayStatus defines the observed state of Gateway
type GatewayStatus struct {
// TODO overall status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that overall GatewayStatus is defined in this PR to gain a full understanding of status. I'm using Pod status as a reference model and it feels like we currently have the scope of status reversed. Instead of surfacing conditions in the subresources, maybe conditions should be inGatewayStatus. For example:

status:
  conditions:
  - lastTransitionTime: "2020-01-27T17:13:49Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2020-01-27T17:13:49Z"
    status: "True"
    type: ListenersReady
  - lastTransitionTime: "2020-01-27T17:13:10Z"
    status: "True"
    type: GatewayScheduled
  listenerStatuses:
    - <listener[0] status>
  routeStatuses:
    - <route[0] status>

Should ListenerStatus and GatewayRouteStatus follow a similar approach to ContainerStatus? For example:

// RouteStatus represents the status of a route
type RouteStatus struct {
	// Each route in a namespace must have a unique name.
	NamespacedName types.NamespacedName
	// +optional
        // This is where ConditionInvalidRoute and ConditionNoSuchRoute can be used.
	State RouteState
	// Should a route have some type of readiness check?
	Ready bool
	// +optional
	RouteID string
}

// ListenerStatus represents the status of a listener.
type ListenerStatus struct {
	// Address bound on this listener.
	Address *ListenerAddress `json:"address"`
        // State reflects if the listener is good or bad (bad address, port, protocol, cert ref, etc..)
        // This is where ConditionInvalidAddress etc. can be used.
	// +optional
	State ListenerState
	// Should a listener have some type of readiness check?
	Ready bool
}

Copy link
Contributor

@youngnick youngnick Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the general idea of having a status.Conditions set that indicates if there are problems with listeners and routes, but I think that, following the API conventions, they should probably be ListenersNotReady, RoutesNotReady, GatewayClassInvalid, and GatewayNotScheduled.

Edit: I used NotReady on purpose, since that means there is something wrong with at least one listener or route. It may be the case that some are working, but some are not. It's intended to be a note that you should check the specific stanzas for details. Naming could probably use some work though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great feedback, thanks! I think my latest update incorporates most of this, but let me know if I've missed anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for things that are continuously reconciled, the "well-known" types from Deployments and ReplicaSets (and stateful sets and daemonsets) should be considered - primarily "Progressing" and "Available" (available may be more relevant here).

In general the pattern of kubectl wait --for=available is specifically designed to deal with the problem of "how do I know when my changes have been made available after an update" in concert with ObservedGeneration. I would recommend at least having an Available condition that matches the requested semantics (when an ingress controller has accepted or begun filling traffic) and ObservedGeneration int64 set by an ingress controller to indicate that the requested spec has been observed and that status conditions apply to that generation of the ingress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of reusing these types, as well as any other relevant ones.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2020
@robscott
Copy link
Member Author

Thanks to everyone for the great feedback! I've pushed an update that makes some significant changes:

  1. Listeners have a new required name field.
  2. ListenerStatus is now linked using that name.
  3. RouteStatus is gone.
  4. A new top level Conditions list has been added.

For 1 and 2, I realized that linking by index would be very fragile. Listeners can always be added or removed, and that would make it difficult to keep status meaningful. The best example I can find of this pattern is that of ContainerStatus for Pods which also relies on a name attribute to link the status and the container. I know requiring a name on listeners is unfortunate, but it is important to provide a status for each listener that can be reliably tied together. I can't think of a better way to accomplish this, but I'm open to ideas.

For 3, it became clear that keeping status per route would be hard to scale. Unlike listeners, routes are just referring to other API resources, granular status can live on those resources.

4 helps some with 3 by providing a new top level set of conditions, including a way to indicate if there are any listeners or routes that are invalid. It also allows for new condition types that are specific to the Gateway as a whole.

Thanks again to everyone who left feedback on the first proposal here, please take another look whenever you get a chance.

@youngnick
Copy link
Contributor

/lgtm

Nice work. I agree that forcing listeners to have a name is a little annoying, but as you say, it's similar to the container in Pods, so it's not unknown. And allowing the ListenerStatus to be retrievable via name is excellent.

I think those Conditions will be great for tool-based parsing of the status.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2020
Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the new .status.conditions is a good improvement.

I did poke a bit at duck-typing the .status.conditions field so that it could contain a different types of Condition (that all shared the common fields), but whilst this is do-able in raw YAML/JSON, I could not find a way for the Kubernetes code generators to know about anything other than the base Condition type. Possibly the unions KEP could help.

api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
// listeners os not ready.
ConditionListenersNotReady GatewayConditionType = "ListenersNotReady"
// ConditionInvalidListeners indicates that at least one of the specified
// routes is invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/routes/listeners/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ConditionInvalidListeners is raised, then there must also be a corresponding entry in .status.conditions.listeners.conditions. We should document that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, thanks! I also added a similar requirement for the ListenersNotReady condition along with a new ListenerNotReady condition that can be set for specific listeners.

ConditionRoutesNotReady GatewayConditionType = "RoutesNotReady"
// ConditionInvalidRoutes indicates that at least one of the specified
// routes is invalid.
ConditionInvalidRoutes GatewayConditionType = "InvalidRoutes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the operator/tooling know which route was invalid? There probably should be some way to link this to the broken route that is more deterministic than scanning the Message. Maybe the same approach as with Listeners (specific route conditions in .status.routes.conditions) could work.

I think that it is fine to leave a TODO and defer this to a separate issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah routes are tough to get right here. Theoretically all status for an individual route could exist on that API resource. The potential quantity of routes that could be contained within a Gateway makes me hesitant to add status for each one within the Gateway resource. It seems like we'd be at risk of hitting object size limits here. I think how we handle route status here should be dependent on what we determine is a reasonable limit for the number of routes that should be contained/referenced within a Gateway. I do agree that it would be nice to have status for all the things in one place, so if we can accomplish that, it would be good. Like you suggested, this may be better for a TODO and/or separate issue.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2020
@danehans
Copy link
Contributor

@howardjohn thanks for iterating on the PR. I had a few nits, otherwise it looks good to me.

docs-src/concepts.md Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
api/v1alpha1/gateway_types.go Outdated Show resolved Hide resolved
docs-src/concepts.md Outdated Show resolved Hide resolved
Within GatewayStatus, Listeners will have status entries corresponding to their
name. Both GatewayStatus and ListenerStatus follow the conditions pattern used
elsewhere in Kubernetes. This is a list that includes a type of condition, the
status of that condition, and the last time this condition changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GatewayStatus", "Listener", and "ListenerStatus" need mark-up for consistency with the rest of the document: `GatewayStatus`, `Listener`, `ListenerStatus`.

Suggested rewording for this sentence: "Namely, the conditions field of GatewayStatus or ListenerStatus has a list of conditions, where each condition has a unique type, a status, the last time the condition changed, and optionally a reason and message with more details about the condition."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got all of these, but there were some I wasn't quite sure on. It looks like we have some inconsistency throughout around where we use backticks or not.

@evankanderson
Copy link
Contributor

(I found this from some discussion elsewhere about conditions.)

If you expect end-users to read your status.conditions, I'd suggest switching the polarity of your conditions to indicate positive statuses. This seems to be easier for users to understand, particularly when they are debugging issues. In Knative, we found that users tended to get upset when they read that their Revision was "Failed" (with value "False"); it's possible that they found the double-negative confusing.

When standardizing our Conditions, we deliberately switched to a "True" polarity because users won't read the value when they're upset that something may not be working. Remember, users may be in a charged emotional state and only processing about 1/4 of what they see on the screen, especially during critical times:

  • They're in a big demo
  • Their site is down
  • There's a security fix that needs to go out
  • It's 3pm and they're late to pick up their kids from daycare

If you're curious, we also wrote a duck-typing library for managing Conditions:

https://godoc.org/knative.dev/pkg/apis#ConditionManager

And documentation is here:

https://knative.dev/docs/serving/spec/knative-api-specification-1.0/#error-signalling

Another place where we ended up extending the Kubernetes convention is that we added a "Severity" field to allow passing back non-error warning information; this is useful for conditions which may be sub-optimal but not explicitly failed (e.g. a Gateway that uses a DNS name, but the DNS name is pointed to an unexpected IP).

@bowei
Copy link
Contributor

bowei commented Jan 30, 2020

Yes -- we have found the advice counter intuitive to follow. This should be raised to sig-arch as it seems like many resources are already non-compliant.

@smarterclayton
Copy link

The advice is wrong, we just haven't consolidated the case law down. Use positive conditions that mirror fundamental states of your resource, ideally matching patterns established in workload controllers (which have received the most TLC on this).

const (
// ConditionNoSuchGatewayClass indicates that the specified GatewayClass
// does not exist.
ConditionNoSuchGatewayClass GatewayConditionType = "NoSuchGatewayClass"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of these that apply to Gateway itself and not some subset of related things (e.g. Listener, Route) I'm just looking for some help understanding conventions...

It seems to be enumerating possible states which abstractly represent unavailability. How does this approach of representing them as condition types compare to representing them as reasons associated with a general "Unavailable" condition type? One reason might be because the Gateway is unavailable for multiple reasons.

However, even if using multiple condition types is preferable, does there still need to be a higher level condition that represents, in aggregate, "the gateway is unavailable"? Otherwise, which permutation of condition types on the Gateway represents total/general unavailability and how would users know that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that by defining types we are making it clear what the list of conditions should be. The types are just a proxy for defining the Reason field anyway. Having the types means that we are constructing a canonical list of Reasons that the API defines. I don't know if this is the best way to do it, but sounds like this is pretty up in the air (from Clayton's comment earlier), so we will be also defining the best practice to an extent in this repo.

How does that interact with 'Use positive conditions that mirror fundamental states of your resource'? I'm not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Nick. To be clear, it sounds like this doesn't translate into any specific requirement/contract for implementations (i.e. "implementations must publish [condition set]"), but they should implement them consistently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

@jpeach
Copy link
Contributor

jpeach commented Jan 30, 2020 via email

@youngnick
Copy link
Contributor

@smarterclayton, it seemed to me like part of the intent with having conditions be negative was that if the conditions slice was empty, then everything was okay. I can see that this is a slightly weird way of doing things, and I understand @evankanderson's point about the UX. Is this something that needs more discussion, a KEP or something, or something decided already?

If it's all decided, does the api conventions document need updating? I can have a crack at making it less confusing if need be. That will be useful for this project in that it will prevent arguments about matching the conventions doc before they happen.

@robscott
Copy link
Member Author

Thanks to everyone for the great feedback here! I think this PR has reached a "good enough for now" state, with the idea that we can make changes in the future as we find the shortcomings of this approach. A change that will likely come in the future is a transition to positive conditions. My personal preference is to wait for sig-arch to formalize their updated opinion on this in the docs before we make that switch here.

@youngnick
Copy link
Contributor

As I said on the call, if positive conditions are the go, I would like to switch as soon as possible. But, I'm in agreement that we should get this merged first, and switch them later. This is all alpha anyway.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2020
@smarterclayton
Copy link

@smarterclayton, it seemed to me like part of the intent with having conditions be negative was that if the conditions slice was empty, then everything was okay. I can see that this is a slightly weird way of doing things, and I understand @evankanderson's point about the UX. Is this something that needs more discussion, a KEP or something, or something decided already?

In general, if you can have a positive or negative condition, try to focus on positive conditions first, then if you need additional special case stuff negative conditions are ok, but not necessarily encourage. Trying to find the most top level generic "positive condition" is a useful exercise of ensuring you understand what data your user cases about (exceptional cases are by definition not what users care about).

@smarterclayton
Copy link

In general an empty conditions slice is what I would consider a bug. You should report at least one positive condition that indicates when you are meeting user intent and when not. Certainly in api review I would hold approval on an empty conditions array.

@bowei
Copy link
Contributor

bowei commented Feb 3, 2020

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2020
@k8s-ci-robot k8s-ci-robot merged commit 12ca3b8 into kubernetes-sigs:master Feb 3, 2020
@robscott robscott deleted the gateway-status branch January 8, 2022 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.