Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP-1742: HTTPRoute Timeouts API #1997

Merged
merged 28 commits into from
Jun 26, 2023

Conversation

frankbu
Copy link
Contributor

@frankbu frankbu commented May 4, 2023

What type of PR is this?

/kind gep

What this PR does / why we need it:

Adds API proposal to GEP-1742.
Related issues: #1742

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:


@k8s-ci-robot k8s-ci-robot added kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 4, 2023
@k8s-ci-robot k8s-ci-robot requested a review from mikemorris May 4, 2023 19:32
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 4, 2023
@k8s-ci-robot k8s-ci-robot requested a review from youngnick May 4, 2023 19:32
@k8s-ci-robot
Copy link
Contributor

Hi @frankbu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

geps/gep-1742.md Outdated
Comment on lines 331 to 332
// One or more conditions can be specified using a ‘,’ delimited list.
RetryOn string `json:"retry_on,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a list it should be []string IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

geps/gep-1742.md Outdated
// Specifies the conditions under which retry takes place.
// One or more conditions can be specified using a ‘,’ delimited list.
RetryOn string `json:"retry_on,omitempty"`
// Flag to specify whether the retries should retry to other localities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems under-specified given there is no mention of "localities" anywhere else in the repo. Can we expand on this a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another section below.

geps/gep-1742.md Outdated
```go
type HTTPRouteRule struct {
// Timeout for HTTP requests, disabled by default.
Timeout *time.Duration `json:"timeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand the specification for this a bit? Given the GEP already documents probably 20 types of "timeout" in https://gateway-api.sigs.k8s.io/geps/gep-1742/?h=1742#flow-diagrams-with-available-timeouts, having just a single field as "timeout" with no further explanation seems a bit ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a paragraph above. PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a flat field, should this be tiered, so you can start off with a request timeout and add things like connect timeout in the future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg Good question. I think it depends on whether implementations support configuring other kinds of timeouts at the HTTPRoute level. Based on the diagrams in the previous sections, I think Envoy does not, but unclear about the others.

@youngnick do you have more insight on this?

Copy link
Contributor

@kflynn kflynn May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this GEP is actually defining two separate timeouts here... 🙂

  • HTTPRouteRule.timeout is defining an end-to-end timeout (a timeout for a single request as measured by the client), and
  • HTTPRouteRule.retries.perTryTimeout is defining a timeout for a single request as measured by the gateway rather than the client.

Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.

Overall, I would prefer

HTTPRouteRule:
    timeouts:
        request: duration            # end-to-end as seen by the client
        workloadRequest: duration    # single request as seen by the gateway

with no perTryTimeout in the retry configuration. (And no, I'm not wedded to the names in the timeouts dict above.) I'd further suggest that the workloadRequest timeout not be part of the core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kflynn This suggestion sounds quite good to me. The reason that I dragged retries into this GEP in the first place is because perTryTimeout was so tightly coupled to retries, but moving it out and generalizing it like you suggest allows for better separation of concerns and to only focus on Timeouts in this GEP. I'm happy to make this change but will give others some time to comment first.

geps/gep-1742.md Outdated
// Timeout per attempt for a given request, including the initial call and any retries.
// Default is the value of the HTTPRouteRule request Timeout.
PerTryTimeout *time.Duration `json:"per_try_timeout,omitempty"`
// Specifies the conditions under which retry takes place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conditions are valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explain below that HTTP error codes are valid and must be supported by implementations. Other implementation-specific identifier are also valid, depending on the implementation.

  1. Does that sound reasonable?
  2. Do you think I need to explain it more clearly in the below text?

geps/gep-1742.md Outdated

### Timeout values

Timeout and PerTryTimeout values are formatted as 1h/1m/1s/1ms and MUST BE >= 1ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere else in gateway-api or core k8s? Is there a more formal spec?

is 1h5m valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any other use of duration in gateway-api. I couldn't find it in core k8s either, but maybe not looking in the right place? Otherwise, I think this is just the Go time.Duration string serialization that we'd use?

@shaneutt
Copy link
Member

shaneutt commented May 9, 2023

/cc

Let's hold, to make sure more people get some time to review.

/hold

@k8s-ci-robot k8s-ci-robot requested a review from shaneutt May 9, 2023 14:41
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2023
Copy link
Contributor

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for pulling retries into a separate GEP, but for pulling the perTryTimeout up into the main timeouts section. Thanks for diving into this!

geps/gep-1742.md Outdated
```go
type HTTPRouteRule struct {
// Timeout for HTTP requests, disabled by default.
Timeout *time.Duration `json:"timeout,omitempty"`
Copy link
Contributor

@kflynn kflynn May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this GEP is actually defining two separate timeouts here... 🙂

  • HTTPRouteRule.timeout is defining an end-to-end timeout (a timeout for a single request as measured by the client), and
  • HTTPRouteRule.retries.perTryTimeout is defining a timeout for a single request as measured by the gateway rather than the client.

Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.

Overall, I would prefer

HTTPRouteRule:
    timeouts:
        request: duration            # end-to-end as seen by the client
        workloadRequest: duration    # single request as seen by the gateway

with no perTryTimeout in the retry configuration. (And no, I'm not wedded to the names in the timeouts dict above.) I'd further suggest that the workloadRequest timeout not be part of the core.

geps/gep-1742.md Outdated

type HTTPRetry struct {
// Number of retries to be allowed for a given request.
Attempts int32 `json:"attempts,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have quite a few comments on retries, but I really think that the retry portion here needs to be pulled out into a different GEP. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this

@youngnick
Copy link
Contributor

Just a note to say I have seen this, but need a bit more time to give it a full review. I should hopefully be able to do this later on tomorrow.

@SRodi SRodi mentioned this pull request May 12, 2023
@frankbu
Copy link
Contributor Author

frankbu commented May 15, 2023

@youngnick @howardjohn @kflynn @arkodg and other reviewers:

I have made a number of updates to address reviewer comments and removed retry configuration from this proposal. This PR is now only addressing timeout configuration for GEP issue #1742. Will move discussion of retries to a separate GEP for issue #1731.

geps/gep-1742.md Outdated

There a 2 kinds of timeouts that can be configured in an `HTTPRouteRule`:

1. `timeouts.request` is an end-to-end timeout for a single request as measured by the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic: we cannot measure the client end-to-end time, as there is latency between the client and us.

Copy link
Contributor

@kflynn kflynn May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I originally wrote that description... uh... maybe describe it as "an end-to-end timeout for a single request from the client to the Gateway API implementation", contrasted with "a single backend request from the Gateway API implementation to a workload"?

This would probably benefit from repeating the diagrams above to show where the two timeouts should measure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not so pedantic. At a high level, the "useful" timeout that we're trying to support here is basically the time it takes to get a response. In the various diagrams above, it seems like there is a timeout that corresponds to the Finishes Response arrow, which is more-or-less what we want, but the question of when the timeout starts measuring is important. John pointed out to me that in the case of Envoy, the implementation route timeout (R timeout in the above diagram) does not start until the entire downstream request stream has been received. Not sure about Nginx and other impls, but seems we need to be more flexible with this definition, to make it implementable.

Maybe something like this:

timeouts.request is the timeout for the Gateway API implementation to send a response to a client HTTP request. Whether the gateway starts the timeout before or after the entire client request stream has been received, is implementation dependent.

Maybe we could first suggest the timeout begins after the entire client request stream has been received, and see if that is a problem for non-Envoy implementations?

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the diagrams I did earlier (needed to recover context myself), I think that it's going to be difficult to define this timeout in such a way that all the implementations can implement it.

I was going to say something like "from when the proxy has finished receiving the request from the client to when it starts sending the response back to the client", but some proxies can't do that, I think.

I'd definitely encourage @frankbu to take the initial source diagram and try and mark where the request timeout would cover on a diagram for this section, and for the backendRequest timeout as well.

It may be that we need to be more vague here and say something like "this timeout covers as close to the whole request->response transaction as is possible for the proxy implementation, given its configurable timeouts." and then leave some of it up to the implementations. This would have the downside of being difficult to test though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the common timeout we want to handle is for when it's taking too long to get the backend response. We should be able to test that, even if we make the spec a little vague about whether the timeout value includes other request overhead or not.

I'll update the wording to make it more clear, including adding the diagram with the proposed timeouts marked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geps/gep-1742.md Outdated
A client HTTP request to a gateway may result in more than one call to the destination
backend service, for example, if automatic retries are supported. Some implementations
support configuring a timeout for the individual backend requests, seperate from the
overall client request timeout. Adding None-Core support for this would also benefit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/None-Core/non-Core/ ?

Copy link
Contributor

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankbu Thanks much!! I left a couple of small comments about wordsmithing but honestly, I'd be OK merging this. Others may have other thoughts, of course. 🙂

Copy link
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the structure looks okay to me - although I think that as it stands, this proposal will not be able to be combined with a TimeoutPolicy to default or override these fields, as we currently don't allow Policy fields to mess with the values of fields within an object.

That would just mean that this setting would only be configurable on two levels: a default in the implementation somehow, configured via runtime config, environment variables, or similar, and then a HTTPRoute-specific value which must be configured on every HTTPRoute which needs a different timeout to the default. (The additional boilerplate adds up over lots of separate objects over time, in my experience).

@frankbu
Copy link
Contributor Author

frankbu commented May 16, 2023

I think that as it stands, this proposal will not be able to be combined with a TimeoutPolicy to default or override these fields, as we currently don't allow Policy fields to mess with the values of fields within an object.

@youngnick I understand and agree that a TimeoutPolicy shouldn't be able to override the fields in the route rule, but why couldn't we (in the future) allow a TimeoutPolicy to set the default values at the namespace, gateway, etc. level?

@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 16, 2023
@frankbu
Copy link
Contributor Author

frankbu commented Jun 13, 2023

@youngnick

I suggest HTTPRouteRequestTimeout and HTTPRouteBackendTimeout.

SGTM. Should that be included in this document? If so where?

Maybe it should be included in the comments of the Go APIs? #2013

@youngnick
Copy link
Contributor

I think we should add a new standard section to GEPs - "Conformance Details", that's a subsection of the "API" section, that lists the feature name(s) so that they're agreed on as part of this, and then can be used in standard ways as part of the conformance tests and profiles (this is particularly important for Extended features.)

@shaneutt @mlavacca and anyone else who's working on the conformance profiles stuff.

@frankbu
Copy link
Contributor Author

frankbu commented Jun 14, 2023

@youngnick @shaneutt @mlavacca and others - Added Nick's suggested "Conformance Details" section. PTAL.

hawkw added a commit to linkerd/linkerd2-proxy that referenced this pull request Jun 14, 2023
The latest proxy-api release, v0.10.0, adds fields to the
`OutboundPolicies` API for configuring HTTP request timeouts, based on
the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.

This branch updates the proxy-api dependency to v0.10.0 and adds the new
timeout configuration fields to the proxy's internal client policy
types. In addition, this branch adds a timeout middleware to the HTTP
client policy stack, so that the timeout described by the
`Rule.request_timeout` field is now applied.

Implementing the `RouteBackend.request_timeout` field with semantics as
close as possible to those described in GEP-1742 will be somewhat more
complex, and will be added in a separate PR.
the following feature names:

- `HTTPRouteRequestTimeout`: supports `rules.timeouts.request` in an `HTTPRoute`.
- `HTTPRouteBackendTimeout`: supports `rules.timeouts.backendRequest` in an `HTTPRoute`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it HTTPRouteBackendRequestTimeout? What if we add other backend timeouts in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary if we think of request as the "default" backend timeout. If we add other ones in the future, e.g., rules.timeouts.backendIdle, they need to include the kind explicitly: HTTPRouteBackendIdelTimeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to let the feature name matches the field name. But it's fine to keep the current name.

hawkw added a commit to linkerd/linkerd2-proxy that referenced this pull request Jun 15, 2023
Depends on #2418 

The latest proxy-api release, v0.10.0, adds fields to the
`OutboundPolicies` API for configuring HTTP request timeouts, based on
the proposed changes to HTTPRoute in kubernetes-sigs/gateway-api#1997.
PR #2418 updates the proxy to depend on the new proxy-api release, and
implements the `Rule.request_timeout` field added to the API. However,
that branch does *not* add a timeout for the
`RouteBackend.request_timeout` field. This branch changes the proxy to
apply the backend request timeout when configured by the policy
controller.

This branch implements `RouteBackend.request_timeout` by adding an
additional timeout layer in the `MatchedBackend` stack. This applies the
per-backend timeout once a backend is selected for a route. I've also
added stack tests for the interaction between the request and backend
request timeouts.

Note that once retries are added to client policy stacks, it may be
necessary to move the backend request timeout to ensure it occurs
"below" retries, depending on where the retry middleware ends up being
located in the proxy stack.
SRodi added a commit to SRodi/gateway-api that referenced this pull request Jun 16, 2023
* update HTTPRoute interfaces according to latest proposal in GEP-1742
* include Request and BackendRequest in HTTPRouteTimeouts
* add validation for HTTPRouteTimeouts
* add unit tests in httproute_test.go
* relates to kubernetes-sigs#1997
SRodi added a commit to SRodi/gateway-api that referenced this pull request Jun 16, 2023
* update HTTPRoute interfaces according to latest proposal in GEP-1742
* include Request and BackendRequest in HTTPRouteTimeouts
* add validation for HTTPRouteTimeouts
* add unit tests in httproute_test.go
* relates to kubernetes-sigs#1997
SRodi added a commit to SRodi/gateway-api that referenced this pull request Jun 16, 2023
* update HTTPRoute interfaces according to latest proposal in GEP-1742
* include Request and BackendRequest in HTTPRouteTimeouts
* add validation for HTTPRouteTimeouts
* add unit tests in httproute_test.go
* relates to kubernetes-sigs#1997
@youngnick
Copy link
Contributor

Thanks for the work @frankbu, this LGTM. I'll leave someone else to unhold though.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2023
@robscott
Copy link
Member

Thanks @frankbu!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2023
@k8s-ci-robot k8s-ci-robot merged commit 592d31a into kubernetes-sigs:main Jun 26, 2023
SRodi added a commit to SRodi/gateway-api that referenced this pull request Aug 25, 2023
* update HTTPRoute interfaces according to latest proposal in GEP-1742
* add Duration type as per GEP-2257
* include Request and BackendRequest in HTTPRouteTimeouts
* add validation for HTTPRouteTimeouts
* add unit tests in httproute_test.go
* relates to kubernetes-sigs#1997
SRodi added a commit to SRodi/istio that referenced this pull request Sep 14, 2023
@frankbu frankbu deleted the request-timeout branch September 25, 2023 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.