Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HTTP compression with a more scoped impl, only use on responses > 128KB #77449

Merged
merged 2 commits into from
Jul 9, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented May 4, 2019

The previous HTTP compression implementation functioned as an HTTP filter, which required it to deal with a number of special cases that complicated the implementation and prevented it from ever being turned on by default.

Instead, when we write an API object to a response, handle only the one case of a valid Kube object being encoded to the output. This will allow a more limited implementation that does not impact other code flows and is easier to reason about, as well as promote this to beta.

Because Golang clients request gzipping by default, and gzip has a significant CPU cost on small requests, ignore requests to compress objects that are smaller than 128KB in size. The goal of this feature is to reduce bandwidth and latency requirements on large lists, even with chunking, and 128KB is smaller than a 500 pod page but larger than almost any single object request.

Also fixes a bug introduced in #50342 because httpResponseWriterWithInit.Write wasn't a pointer receiver - the init code was called repeatedly:

2019/05/04 19:15:31 http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.httpResponseWriterWithInit.Write (writers.go:56)
2019/05/04 19:15:31 http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.httpResponseWriterWithInit.Write (writers.go:56)
2019/05/04 19:15:31 http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/endpoints/handlers/responsewriters.httpResponseWriterWithInit.Write (writers.go:56)

/kind bug

KEP kubernetes/enhancements#1115

Kubernetes now supports transparent compression of API responses. Clients that send `Accept-Encoding: gzip` will now receive a GZIP compressed response body if the API call was larger than 128KB.  Go clients automatically request gzip-encoding by default and should see reduced transfer times for very large API requests.  Clients in other languages may need to make changes to benefit from compression.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 4, 2019
@k8s-ci-robot k8s-ci-robot requested review from CaoShuFeng and sttts May 4, 2019 22:16
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2019
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

@liggitt since you reviewed #50342

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

# 99th for this PR on cluster lists
$ api-100 https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77449/pull-kubernetes-e2e-gce-100-performance/1124864914138075139/artifacts/APIResponsiveness_density_2019-05-05T03:01:32Z.json
LIST  cronjobs                cluster  5.486   46
LIST  namespaces              cluster  13.926  16
LIST  jobs                    cluster  14.989  46
LIST  persistentvolumes       cluster  26.484  16
LIST  pods                    cluster  31.564  4
LIST  replicationcontrollers  cluster  33.16   6
LIST  services                cluster  42.151  6
LIST  nodes                   cluster  74.265  117

# 99th for master(ish) on cluster lists
$ api-100 https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77341/pull-kubernetes-e2e-gce-100-performance/1124998416011628545/artifacts/APIResponsiveness_density_2019-05-05T11:51:20Z.json
LIST  services                cluster  1.796    4
LIST  pods                    cluster  5.508    5
LIST  jobs                    cluster  9.609    36
LIST  replicationcontrollers  cluster  15.28    6
LIST  cronjobs                cluster  33.418   36
LIST  namespaces              cluster  86.998   12
LIST  persistentvolumes       cluster  89.04    12
LIST  nodes                   cluster  194.802  90

Same but namespace scoped

○ api-100 https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77449/pull-kubernetes-e2e-gce-100-performance/1124864914138075139/artifacts/APIResponsiveness_density_2019-05-05T03:01:32Z.json
LIST  limitranges             namespace  3.034    16
LIST  secrets                 namespace  4.852    159
LIST  replicasets             namespace  10.3     16
LIST  ingresses               namespace  17.373   16
LIST  endpoints               namespace  19.825   16
LIST  statefulsets            namespace  22.191   16
LIST  cronjobs                namespace  25.027   16
LIST  persistentvolumeclaims  namespace  42.206   16
LIST  daemonsets              namespace  46.348   16
LIST  pods                    namespace  48.785   1019
LIST  configmaps              namespace  51.115   119
LIST  resourcequotas          namespace  68.432   17
LIST  jobs                    namespace  82.747   16
LIST  replicationcontrollers  namespace  88.394   16
LIST  deployments             namespace  91.022   16
LIST  services                namespace  128.396  16
○ api-100 https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77341/pull-kubernetes-e2e-gce-100-performance/1124998416011628545/artifacts/APIResponsiveness_density_2019-05-05T11:51:20Z.json
LIST  persistentvolumeclaims  namespace  6.919   12
LIST  secrets                 namespace  7.289   136
LIST  endpoints               namespace  19.084  12
LIST  configmaps              namespace  22.612  79
LIST  cronjobs                namespace  23.684  12
LIST  limitranges             namespace  24.617  14
LIST  daemonsets              namespace  27.742  12
LIST  replicasets             namespace  29.692  12
LIST  statefulsets            namespace  29.859  12
LIST  pods                    namespace  31.246  1015
LIST  replicationcontrollers  namespace  32.53   12
LIST  deployments             namespace  37.013  12
LIST  resourcequotas          namespace  38.098  14
LIST  jobs                    namespace  38.902  12
LIST  services                namespace  43.577  12
LIST  ingresses               namespace  68.877  12

It looks like this brings in the 99th tail on large lists significantly, at a tradeoff of slightly higher latency on small lists. We could potentially tune this at a threshold higher than 16KB - for instance 32KB or 128KB, which would potentially reduce tail latency on lists.

@kubernetes/sig-scalability-pr-reviews

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label May 5, 2019
@smarterclayton smarterclayton changed the title Replace HTTP compression with an inline handler Replace HTTP compression with an inline handler, only use on responses > 16KB May 5, 2019
@smarterclayton
Copy link
Contributor Author

/retest

@jennybuckley
Copy link

/cc @wojtek-t

@smarterclayton
Copy link
Contributor Author

/retest

1 similar comment
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jun 26, 2019

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77449/pull-kubernetes-e2e-gce-100-performance/1143560782466781184/artifacts/APIResponsiveness_density_2019-06-25T17:22:09Z.json

LIST  configmaps              cluster  0.143  0.143   0.143    1
LIST  pods                    cluster  3.036  3.468   3.468    7
LIST  services                cluster  1.373  3.677   3.677    4
LIST  persistentvolumes       cluster  0.973  2.805   4.373    12
LIST  cronjobs                cluster  0.654  3.368   5.06     36
LIST  jobs                    cluster  0.866  3.346   5.763    36
LIST  namespaces              cluster  0.981  3.143   9.611    12
LIST  replicationcontrollers  cluster  0.959  21.873  21.873   6
LIST  nodes                   cluster  1.801  17.66   109.361  90

Nodes in particular is 30% below the p99 of perf-dash right now.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77449/pull-kubernetes-e2e-gce-100-performance/1144003791448707072/artifacts/APIResponsiveness_density_2019-06-26T22:33:50Z.json

LIST  services                cluster  1.627  1.714   1.714    4
LIST  pods                    cluster  3.244  5.241   5.241    6
LIST  jobs                    cluster  1.111  5.045   8.967    36
LIST  namespaces              cluster  1.049  5.849   9.068    12
LIST  cronjobs                cluster  0.868  3.973   9.692    36
LIST  replicationcontrollers  cluster  1.39   27.937  27.937   6
LIST  persistentvolumes       cluster  1.263  7.879   49.757   12
LIST  nodes                   cluster  1.97   20.387  205.261  90

Normal

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/77449/pull-kubernetes-e2e-gce-100-performance/1144281746527752192/artifacts/APIResponsiveness_density_2019-06-27T16:57:09Z.json

LIST  configmaps              cluster  0.204  0.204   0.204    1
LIST  namespaces              cluster  1.033  1.391   1.399    12
LIST  services                cluster  1.265  1.444   1.444    4
LIST  persistentvolumes       cluster  1.146  2.591   4.382    12
LIST  pods                    cluster  3.539  5.169   5.169    5
LIST  replicationcontrollers  cluster  1.778  43.263  43.263   6
LIST  cronjobs                cluster  0.818  13.377  53.772   37
LIST  jobs                    cluster  1.531  9.668   117.324  37
LIST  nodes                   cluster  1.892  21.964  119.049  95

Lower

@wojtek-t
Copy link
Member

Nodes in particular is 30% below the p99 of perf-dash right now.

There is significant variance there, but I agree the results look very promising.

I would like to also run a test on larger scale, once we get out of the last regressions we have: #79096

@smarterclayton
Copy link
Contributor Author

The variance before was all measured where we had the bucketing problem I believe.

I would trade some P99 on large requests in cluster for a dramatically reduced P99 outside the cluster. We don't have an easy way to measure that unless we simulate constrained bandwidth for a client.

@wojtek-t
Copy link
Member

I would trade some P99 on large requests in cluster for a dramatically reduced P99 outside the cluster. We don't have an easy way to measure that unless we simulate constrained bandwidth for a client.

That sounds reasonable to me - I would just like to know how much we are trading (if really this is visible) , I'm definitely NOT saying "we can't do this if it grows at all".

@smarterclayton
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@smarterclayton
Copy link
Contributor Author

Fortunately this is easy to test - we just gate it on or off. All clients are automatically requesting. We can also tune up.

@smarterclayton
Copy link
Contributor Author

Are we ready to try the larger run test now that the other blocker was resolved?

@wojtek-t
Copy link
Member

wojtek-t commented Jul 3, 2019

Are we ready to try the larger run test now that the other blocker was resolved?

I asked @krzysied to patch this to the experiments he is running offline, the plan was to run something over night - will get back to you when I know if that happened or not.

@wojtek-t wojtek-t self-assigned this Jul 3, 2019
@wojtek-t
Copy link
Member

wojtek-t commented Jul 3, 2019

I asked @krzysied to patch this to the experiments he is running offline, the plan was to run something over night - will get back to you when I know if that happened or not.

We don't have full results because we were running some other experiments during that test so it didn't finish. But what we've seen looked good enough so that I'm fine with this PR from scalability POV.

@smarterclayton - do you want me to review the code too (I can do that in the next 2 days).

@smarterclayton
Copy link
Contributor Author

Yes please. Jordan signed off on the KEP, and seeing the variance over time is the biggest factor. This is ready to review

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nits - other than that lgtm.

}

// make a best effort to write the object if a failure is detected
utilruntime.HandleError(fmt.Errorf("apiserver was unable to write a JSON response: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that it was JSON?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's moved from a different place, but if I'm not missing something it would make sense to fix this comment if you're touching this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If encode fails, the response is always JSON. The assumption being that encode will not fail (and thus exit early) above. If encode fails, we just spit out what we have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - that makes sense now

if len(encoding) == 0 {
return ""
}
if !utilfeature.DefaultFeatureGate.Enabled(features.APIResponseCompression) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to check it as a first thing in this method (to avoid unnecessary work otherwise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it this way because the feature gate check is probably more expensive than the map lookup. But it could be the other way if you think it's easier to read (I think it would perform slightly worse).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion - we can change that later too if needed.

@@ -0,0 +1,303 @@
/*
Copyright 2016 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2019

req: &http.Request{Header: http.Header{}},
wantCode: http.StatusBadRequest,
wantHeaders: http.Header{"Content-Type": []string{"application/json"}},
wantBody: smallPayload,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for BadRequest we return an object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm missing something, because you do this in couple other tests below too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's emulating a custom registry endpoint (aggregated api) that returns a valid object but with bad request, which is technically possible. I wanted to have a test that captured behavior that is infrequently used (error code with valid object), but which can show up in some cases (if you had a custom response type that implemented APIStatus, you could return an object + an error code today).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation - that makes a lot of sense to me now.

statusCode: http.StatusOK,
wantCode: http.StatusNotFound,
wantHeaders: http.Header{"Content-Type": []string{"text/plain"}},
wantBody: []byte("NotFound: \"test\" not found"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore, but it seems we have some additional whitespace somewhere in the stack...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually was testing the exact output. The two spaces separates an empty resource string (same message as before this PR).

},

{
name: "errors are compressed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one nit (s/2014/2019) and I don't want to block this PR on it.

/lgtm

}

// make a best effort to write the object if a failure is detected
utilruntime.HandleError(fmt.Errorf("apiserver was unable to write a JSON response: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - that makes sense now

if len(encoding) == 0 {
return ""
}
if !utilfeature.DefaultFeatureGate.Enabled(features.APIResponseCompression) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion - we can change that later too if needed.

req: &http.Request{Header: http.Header{}},
wantCode: http.StatusBadRequest,
wantHeaders: http.Header{"Content-Type": []string{"application/json"}},
wantBody: smallPayload,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation - that makes a lot of sense to me now.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 9, 2019
@k8s-ci-robot k8s-ci-robot merged commit 7c7d70b into kubernetes:master Jul 9, 2019
@smarterclayton
Copy link
Contributor Author

Encouraging:

image

Significant reduction in p99 on cluster nodes LIST

Looks like CPU is variable, which is interesting:

image

@smarterclayton
Copy link
Contributor Author

Scratch that, it looks like there's intersections with the replica set / deployment change yesterday, and the runs are newer.

@smarterclayton
Copy link
Contributor Author

Even taking the RS/DS change into account, I'd say we're within bounds on p99 (maybe a bit more variability across runs). Will continue to monitor today and tomorrow, but looks like a small win in some flows and no huge impact elsewhere.

New: func() interface{} {
gw, err := gzip.NewWriterLevel(nil, defaultGzipContentEncodingLevel)
if err != nil {
panic(err)
Copy link
Contributor

@tedyu tedyu Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an error be returned here ?

In deferredResponseWriter#Write, the error can be returned to caller.

See #79943

@wojtek-t
Copy link
Member

Even taking the RS/DS change into account, I'd say we're within bounds on p99 (maybe a bit more variability across runs). Will continue to monitor today and tomorrow, but looks like a small win in some flows and no huge impact elsewher

I think that on the scale of 100-nodes, the change is small enough that it's not easy to say what happened.

When looking into 5k-node scale, things like node-listing, regressed visibly (because those are called from within the cluster, actually even via localhost, where network throughput and latency is not a problem):
http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=nodes&Scope=cluster&Subresource=&Verb=LIST
but on deployments (being called externally from test-framework) the signiifcant drop is visible:
http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=deployments&Scope=cluster&Subresource=&Verb=LIST
we should also merge it somewhat with this graph, due to change from RCs to Deployments:
http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=LoadResponsiveness&Resource=replicationcontrollers&Scope=cluster&Subresource=&Verb=LIST

So I would say that it's a reasonable tradeoff to take.

@smarterclayton
Copy link
Contributor Author

I'm wondering whether there is an additional heuristic we could add on the client side to suppress gzip encoding when making requests to localhost. It's kind of a weak heuristic though.

@smarterclayton
Copy link
Contributor Author

We could also have a way to have clients bypass compression when run in certain modes (i.e. kcm and scheduler disabling it)

@wojtek-t
Copy link
Member

We could also have a way to have clients bypass compression when run in certain modes (i.e. kcm and scheduler disabling it)

yeah - that is exactly what I was thinking about - add an option to client config to disable compression, default it to false (I mean enable compression by default) and disable it only in scheduler and kcm (which I believe would be enough).

@sftim
Copy link
Contributor

sftim commented Nov 25, 2021

Also see kubernetes/website#30639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.