-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making Gateway Status more descriptive #47
Conversation
841f1b4
to
2b8bd7e
Compare
2b8bd7e
to
2bbba23
Compare
I agree that moving towards Conditions is an improvement to the API. I'd go further and suggest that the Conditions slice should be a direct field of the gateway status and contain both route and listener conditions: type GatewayCondition struct {
Type ConditionType `json:"type"`
Status core.ConditionStatus `json:"status"`
Message string `json:"message"`
Reason string `json:"reason"`
LastTransitionTime metav1.Time `json:"lastTransitionTime"`
Ref []core.TypedLocalObjectReference `json:"ref"`
}
// GatewayStatus defines the observed state of Gateway
type GatewayStatus struct {
Conditions []GatewayConditions `json:"conditions"`
} I had thought that kapp inspected I think that we can reasonably represent Conditions for both routes and listeners in the same slice, though tools that want to inspect the gateway state in detail would need to switch on the Condition type. Compared to the separate array of Conditions, this suggestion could be more compact (we don't need a slice element for each route and listener), but it may be harder to determine which route or listener is problematic. It is likely that a single GatewayCondition would end up having fields that are only used for certain condition types (e.g. a The See also:
|
@robscott as mentioned during yesterday's call, we need a way to summarize route status for scalability purposes. The docs should also be explicit about mandatory/optional fields. |
@@ -179,8 +178,6 @@ type ListenerTLS struct { | |||
|
|||
// GatewayStatus defines the observed state of Gateway | |||
type GatewayStatus struct { | |||
// TODO overall status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important that overall GatewayStatus
is defined in this PR to gain a full understanding of status. I'm using Pod
status as a reference model and it feels like we currently have the scope of status reversed. Instead of surfacing conditions in the subresources, maybe conditions should be inGatewayStatus
. For example:
status:
conditions:
- lastTransitionTime: "2020-01-27T17:13:49Z"
status: "True"
type: Ready
- lastTransitionTime: "2020-01-27T17:13:49Z"
status: "True"
type: ListenersReady
- lastTransitionTime: "2020-01-27T17:13:10Z"
status: "True"
type: GatewayScheduled
listenerStatuses:
- <listener[0] status>
routeStatuses:
- <route[0] status>
Should ListenerStatus
and GatewayRouteStatus
follow a similar approach to ContainerStatus? For example:
// RouteStatus represents the status of a route
type RouteStatus struct {
// Each route in a namespace must have a unique name.
NamespacedName types.NamespacedName
// +optional
// This is where ConditionInvalidRoute and ConditionNoSuchRoute can be used.
State RouteState
// Should a route have some type of readiness check?
Ready bool
// +optional
RouteID string
}
// ListenerStatus represents the status of a listener.
type ListenerStatus struct {
// Address bound on this listener.
Address *ListenerAddress `json:"address"`
// State reflects if the listener is good or bad (bad address, port, protocol, cert ref, etc..)
// This is where ConditionInvalidAddress etc. can be used.
// +optional
State ListenerState
// Should a listener have some type of readiness check?
Ready bool
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the general idea of having a status.Conditions set that indicates if there are problems with listeners and routes, but I think that, following the API conventions, they should probably be ListenersNotReady
, RoutesNotReady
, GatewayClassInvalid
, and GatewayNotScheduled
.
Edit: I used NotReady
on purpose, since that means there is something wrong with at least one listener or route. It may be the case that some are working, but some are not. It's intended to be a note that you should check the specific stanzas for details. Naming could probably use some work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great feedback, thanks! I think my latest update incorporates most of this, but let me know if I've missed anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for things that are continuously reconciled, the "well-known" types from Deployments and ReplicaSets (and stateful sets and daemonsets) should be considered - primarily "Progressing" and "Available" (available may be more relevant here).
In general the pattern of kubectl wait --for=available
is specifically designed to deal with the problem of "how do I know when my changes have been made available after an update" in concert with ObservedGeneration
. I would recommend at least having an Available
condition that matches the requested semantics (when an ingress controller has accepted or begun filling traffic) and ObservedGeneration int64
set by an ingress controller to indicate that the requested spec has been observed and that status conditions apply to that generation of the ingress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of reusing these types, as well as any other relevant ones.
2bbba23
to
6bcd52a
Compare
Thanks to everyone for the great feedback! I've pushed an update that makes some significant changes:
For 1 and 2, I realized that linking by index would be very fragile. Listeners can always be added or removed, and that would make it difficult to keep status meaningful. The best example I can find of this pattern is that of ContainerStatus for Pods which also relies on a name attribute to link the status and the container. I know requiring a name on listeners is unfortunate, but it is important to provide a status for each listener that can be reliably tied together. I can't think of a better way to accomplish this, but I'm open to ideas. For 3, it became clear that keeping status per route would be hard to scale. Unlike listeners, routes are just referring to other API resources, granular status can live on those resources. 4 helps some with 3 by providing a new top level set of conditions, including a way to indicate if there are any listeners or routes that are invalid. It also allows for new condition types that are specific to the Gateway as a whole. Thanks again to everyone who left feedback on the first proposal here, please take another look whenever you get a chance. |
6bcd52a
to
20532fa
Compare
/lgtm Nice work. I agree that forcing listeners to have a name is a little annoying, but as you say, it's similar to the container in Pods, so it's not unknown. And allowing the ListenerStatus to be retrievable via name is excellent. I think those Conditions will be great for tool-based parsing of the status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the new .status.conditions
is a good improvement.
I did poke a bit at duck-typing the .status.conditions
field so that it could contain a different types of Condition (that all shared the common fields), but whilst this is do-able in raw YAML/JSON, I could not find a way for the Kubernetes code generators to know about anything other than the base Condition type. Possibly the unions KEP could help.
api/v1alpha1/gateway_types.go
Outdated
// listeners os not ready. | ||
ConditionListenersNotReady GatewayConditionType = "ListenersNotReady" | ||
// ConditionInvalidListeners indicates that at least one of the specified | ||
// routes is invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/routes/listeners/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ConditionInvalidListeners
is raised, then there must also be a corresponding entry in .status.conditions.listeners.conditions
. We should document that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, thanks! I also added a similar requirement for the ListenersNotReady condition along with a new ListenerNotReady condition that can be set for specific listeners.
ConditionRoutesNotReady GatewayConditionType = "RoutesNotReady" | ||
// ConditionInvalidRoutes indicates that at least one of the specified | ||
// routes is invalid. | ||
ConditionInvalidRoutes GatewayConditionType = "InvalidRoutes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the operator/tooling know which route was invalid? There probably should be some way to link this to the broken route that is more deterministic than scanning the Message
. Maybe the same approach as with Listeners (specific route conditions in .status.routes.conditions
) could work.
I think that it is fine to leave a TODO and defer this to a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah routes are tough to get right here. Theoretically all status for an individual route could exist on that API resource. The potential quantity of routes that could be contained within a Gateway makes me hesitant to add status for each one within the Gateway resource. It seems like we'd be at risk of hitting object size limits here. I think how we handle route status here should be dependent on what we determine is a reasonable limit for the number of routes that should be contained/referenced within a Gateway. I do agree that it would be nice to have status for all the things in one place, so if we can accomplish that, it would be good. Like you suggested, this may be better for a TODO and/or separate issue.
20532fa
to
5bc6b0e
Compare
@howardjohn thanks for iterating on the PR. I had a few nits, otherwise it looks good to me. |
5bc6b0e
to
b9a88a3
Compare
docs-src/concepts.md
Outdated
Within GatewayStatus, Listeners will have status entries corresponding to their | ||
name. Both GatewayStatus and ListenerStatus follow the conditions pattern used | ||
elsewhere in Kubernetes. This is a list that includes a type of condition, the | ||
status of that condition, and the last time this condition changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"GatewayStatus", "Listener", and "ListenerStatus" need mark-up for consistency with the rest of the document: `GatewayStatus`
, `Listener`
, `ListenerStatus`
.
Suggested rewording for this sentence: "Namely, the conditions
field of GatewayStatus
or ListenerStatus
has a list of conditions, where each condition has a unique type, a status, the last time the condition changed, and optionally a reason and message with more details about the condition."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got all of these, but there were some I wasn't quite sure on. It looks like we have some inconsistency throughout around where we use backticks or not.
(I found this from some discussion elsewhere about conditions.) If you expect end-users to read your When standardizing our Conditions, we deliberately switched to a "True" polarity because users won't read the value when they're upset that something may not be working. Remember, users may be in a charged emotional state and only processing about 1/4 of what they see on the screen, especially during critical times:
If you're curious, we also wrote a duck-typing library for managing Conditions: https://godoc.org/knative.dev/pkg/apis#ConditionManager And documentation is here: https://knative.dev/docs/serving/spec/knative-api-specification-1.0/#error-signalling Another place where we ended up extending the Kubernetes convention is that we added a "Severity" field to allow passing back non-error warning information; this is useful for conditions which may be sub-optimal but not explicitly failed (e.g. a Gateway that uses a DNS name, but the DNS name is pointed to an unexpected IP). |
Yes -- we have found the advice counter intuitive to follow. This should be raised to sig-arch as it seems like many resources are already non-compliant. |
The advice is wrong, we just haven't consolidated the case law down. Use positive conditions that mirror fundamental states of your resource, ideally matching patterns established in workload controllers (which have received the most TLC on this). |
const ( | ||
// ConditionNoSuchGatewayClass indicates that the specified GatewayClass | ||
// does not exist. | ||
ConditionNoSuchGatewayClass GatewayConditionType = "NoSuchGatewayClass" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some of these that apply to Gateway itself and not some subset of related things (e.g. Listener, Route) I'm just looking for some help understanding conventions...
It seems to be enumerating possible states which abstractly represent unavailability. How does this approach of representing them as condition types compare to representing them as reasons associated with a general "Unavailable" condition type? One reason might be because the Gateway is unavailable for multiple reasons.
However, even if using multiple condition types is preferable, does there still need to be a higher level condition that represents, in aggregate, "the gateway is unavailable"? Otherwise, which permutation of condition types on the Gateway represents total/general unavailability and how would users know that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that by defining types we are making it clear what the list of conditions should be. The types are just a proxy for defining the Reason
field anyway. Having the types means that we are constructing a canonical list of Reasons that the API defines. I don't know if this is the best way to do it, but sounds like this is pretty up in the air (from Clayton's comment earlier), so we will be also defining the best practice to an extent in this repo.
How does that interact with 'Use positive conditions that mirror fundamental states of your resource'? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Nick. To be clear, it sounds like this doesn't translate into any specific requirement/contract for implementations (i.e. "implementations must publish [condition set]"), but they should implement them consistently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
On Jan 31, 2020, at 8:43 AM, Clayton Coleman ***@***.***> wrote:
The advice is wrong, we just haven't consolidated the case law down. Use positive conditions that mirror fundamental states of your resource, ideally matching patterns established in workload controllers (which have received the most TLC on this).
This seems like a serious issue. Evenyone has to agree on how to handle the polarity of Conditions for them to be interoperable. How are projects supposed to know they should do the opposite of the official advice?
I filed kubernetes/community#4471
|
@smarterclayton, it seemed to me like part of the intent with having conditions be negative was that if the conditions slice was empty, then everything was okay. I can see that this is a slightly weird way of doing things, and I understand @evankanderson's point about the UX. Is this something that needs more discussion, a KEP or something, or something decided already? If it's all decided, does the api conventions document need updating? I can have a crack at making it less confusing if need be. That will be useful for this project in that it will prevent arguments about matching the conventions doc before they happen. |
b9a88a3
to
6aec3b1
Compare
6aec3b1
to
e4d3627
Compare
Thanks to everyone for the great feedback here! I think this PR has reached a "good enough for now" state, with the idea that we can make changes in the future as we find the shortcomings of this approach. A change that will likely come in the future is a transition to positive conditions. My personal preference is to wait for sig-arch to formalize their updated opinion on this in the docs before we make that switch here. |
As I said on the call, if positive conditions are the go, I would like to switch as soon as possible. But, I'm in agreement that we should get this merged first, and switch them later. This is all alpha anyway. /lgtm |
In general, if you can have a positive or negative condition, try to focus on positive conditions first, then if you need additional special case stuff negative conditions are ok, but not necessarily encourage. Trying to find the most top level generic "positive condition" is a useful exercise of ensuring you understand what data your user cases about (exceptional cases are by definition not what users care about). |
In general an empty conditions slice is what I would consider a bug. You should report at least one positive condition that indicates when you are meeting user intent and when not. Certainly in api review I would hold approval on an empty conditions array. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bowei, robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This adds more detail to the Gateway Status implementation. It uses the Conditions pattern as suggested earlier by @youngnick. It could result in some duplication of status since both listener and route resources could potentially have their own status fields. Since those can all be different resources with different status attributes, it seems to make sense to provide a consistent set of status fields at this level.
/cc @bowei @youngnick