-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade from v1.12.3 to v1.14.4 leads to more CPU usage without traffic #12080
Comments
I'm pretty skeptical of these symbols. Can you get a trace with a binary with debug symbols? |
I forgot to mention that I also have a custom envoy (we use some of internal plugins). Currently I build envoy with |
If you are stripping the symbols you have to actually make sure perf, etc. can find them, or just don't strip. |
I just reused https://github.com/envoyproxy/envoy/blob/master/ci/Dockerfile-envoy, which uses stripped dir, I guess, I need to use |
btw, are there any performance downsides about using stripped/non-stripped binaries? ie: slower startup due to ~1gb envoy binary size/etc? |
No, no downsides other than increased disk cost, transfer, etc. |
@fxposter Those are just mangled symbol regardless whether you have debug symbols or not. For example
pprof should be able to demangle it with |
Hm, while running with the non-stripped binary - at the end I get the same flamegraph/top with same symbols :( |
@lizan yes, I tried c++filt, flamegraph stacktraces are still quite weird though :( |
Regarding "--symbols" - are you talking about https://github.com/google/pprof? Don't see such option there. |
BTW, the screenshots are taken from pprof that was ran on mac os x, while envoys were running in docker on linux machines. |
To this point
|
When running pprof on mac os (without specifying the binary) - it shows mangled names |
I suspect code that applies new envoy configurations coming from control plane now takes much more time than it was before. Flat %:
Cum%:
|
macOS has different mangling rule of symbols so if the pprof is from Linux it might not demangle well. Aside from that, looking at the pprof, it seems the VersionConverter is the source of CPU usage. I believe it's because your control plane is sending Envoy v2 config instead of v3, so Envoy will convert it to v3 internally. The v3 API took in place since 1.13. If your control plane send v3 protos, I think it will be down closer to the original CPU usage. cc @htuch |
Should it be v3 "all-in", ie: if I just start returning v3 resources while keeping all envoy configs the same (we don't specify it and hence use default AUTO resource version) - it will still be subject to extra CPU usage? ie: in this case control plane will receive request for V2 resource, it will respond with v3 one, envoy will parse it as v3 and convert to v2? |
cause by just switching to sending v3 resources does not seem like it's helping with CPU usage |
If you send all your resources as v3, there should be no conversion happening in Envoy at all; you should not see |
I mean - I changed my control plane (based on go-control-plane) to still using cache/v2 and server/v2 from the library and register API from discovery/v2 (we use ADS). According to docs - I can do that and envoy will accept v3 resources from v2 APIs. But when I do config_dump afterwards - I see that under, for example, |
Envoy can accept v2 APIs and convert internally to v3. The point above is that if you want performance, you need to put v3 resource on the wire at the control plane, so Envoy doesn't need to do the work of converting. |
Yes, I got that. My question is different. If I send v3 resources on the wire (ie: actual v3 resources are marshaled on the control plane side), but lds/cds/rds references in envoy's config will still be auto (ie: envoy will ask v2 resources from control plane, but will receive v3 resources back on the wire) - is that enough or not to not do conversion all the time? |
I think you probably need to change resource versions to v3, otherwise I don't see how it works given #10776 |
CC @jmarantz for stats overheads. |
Thanks for the heads up. Looks like RDS is re-creating a lot of stats on every update? I suspect this code could be optimized a bit:
in source/common/router/config_impl.cc function VirtualHostImpl::VirtualHostImpl. I think it's re-creating the same exact block of stats for every virtual cluster in every virtual host. Not sure of the cardinality of this in general, or for @fxposter's config in particular. Does anyone have more context around this code path? |
Our config may be quite different from others. yes - we have infrequent (once a couple of minutes) updates of clusters and each cluster have separate path in same domain in routeconfiguration (or our control plane is a bit broken at this point, this is separate cluster where I was playing with new envoy version - changing envoy configs/versions/control plane code). but with the same control plane envoy 1.12 uses less cpu and is more stable in terms of CPU usage. |
Oh, and the last flame graph looks like regexes in particular, but I don't think there's been a change in that area of stats recently, and that may be a side effect of wasteful re-creation of the same stats. I would have thought we'd have skipped tag extraction if a stat of the same name was already known. |
If it's needed |
Same configs sent to both 1.12 and 1.14?
…On Thu, Jul 16, 2020 at 3:15 PM Pavel ***@***.***> wrote:
This one is from 1.12, overall structure is the same, but sample time for
1.14 is 10seconds, while for 1.12 it is 3 seconds
[image: image]
<https://user-images.githubusercontent.com/109216/87712617-99b2eb80-c7b1-11ea-9c2e-b1c3e957bc8a.png>
CPU usage (blue and green are 1.14, red is 1.12)
[image: image]
<https://user-images.githubusercontent.com/109216/87712717-c535d600-c7b1-11ea-9fd1-b833e4bdb042.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12080 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPNXHPFFI35IOUEZEBDR35GVPANCNFSM4OZ5AAMA>
.
|
in this case - no, but sending same configs result in this behaviour. in this case I send v3 resources instead of v2 (it doesn't matter much - situation is the same in both cases), but content of resources is same. ie: to eliminate the possibility that resource conversion (v2 to v3) in new envoy adds this cpu usage - I tried to "improve" situation with 1.14 |
In my recent experiences, translating between API versions during xDS updates has been a significant bottleneck. So from a functional perspective you may be right but I am skeptical that it is a no-op for perf. Having said that it does look like there's some potential perf improvements, having looked at the code. This will become actionable if we have a testcase to repro though, otherwise we kind of shooting in the dark. |
After re-checking what is actually in rds/cds - I would say the pattern is to add and remove 2-5 clusters and corresponding routes each 2-10 minutes. 2 resources in RDS (egress/ingress) and path and domain-based mappings for each cluster (ie: for each cluster there is a separate domain and a separate mapping within some single domain). |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
May we please have the Regarding the v3 upgrade, it would seem there's not much that can be done other than to upgrade as soon as possible. Was this called out in the v14 release notes or documentation? At quick glance, I don't see mention of it & this does seem to be a well-known "issue" |
Probably not, doc updates appreciated. It is a "known issue" at this point. And yes, I would upgrade to v3 as soon as you are able. I don't think there is anything else actionable on this issue unfortunately. @htuch has done a bunch of recent changes to improve v2 -> v3 upgrade perf fwiw. |
Tagging for #10943 postmortem; I don't think we had clear evidence that performance was problematic at the time of the release, most discussion has taken place since then, but worth keeping in mind for the future. Some of this comes down to the fact that OSS Envoy has no control plane performance regression tests. @oschaaf is helping to coordinate some of the work on data plane performance regression testing; Otto, do you know if there are any places to add regression tests for control plane on anyone's radar? |
I will provide a sample control plane this week. |
There's a design doc here: https://docs.google.com/document/d/14Iz8j--Mvb06QFB8RurtYlwmy657YbAVfqDr-jKgtaQ/edit#heading=h.grkfe6onmtgv |
https://gist.github.com/fxposter/8887788a575090601ba6b106e80e3230 this is the go-control-plane that shows the effect of that. when changing only paths under some wildcard domain - it consumes less cpu, when constantly adding/removing domains in routes - all those metric generation starts to consume much more CPU. This was not the case for 1.12. |
and looking at graphs above - it seems that quite a lot of time is spent in Envoy::Stats::TagProducerImpl::produceTags/etc. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Description:
While trying to upgrade to latest version of envoy, we faced the problem (maybe) - envoy started to consume 1.5x more cpu (from 7-9% cpu to 12-20%) when there are no traffic to this particular envoy instance. Also, cpu usage graph now have way more spikes than it had before(previously it looked like a flat line - now it is full of spikes).
We have our own control plane and our bootstrap config looks like this:
we have quite a lot of routes/clusters/clusterLoadAssignments there which change quite often, though in this particular case number of changes is not that big - change every 20-300 seconds, but they are still present. Full config dump after initialization is about 6mb.
Doing CPU profiling gives me something like this:
We don't use envoy's healthchecking/rate limiting/etc - I am not sure whether their appearance in cpu profile says anything, but it bothers me a bit. Also, "__ZNSt3__1L20__throw_length_errorEPKc" looks strange.
Full profile of 10 minutes of time is here: envoy.1.14.prof.zip
The text was updated successfully, but these errors were encountered: