Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU consumption in probes #1454

Closed
2opremio opened this issue May 9, 2016 · 5 comments
Closed

High CPU consumption in probes #1454

2opremio opened this issue May 9, 2016 · 5 comments
Assignees
Labels
performance Excessive resource usage and latency; usually a bug or chore
Milestone

Comments

@2opremio
Copy link
Contributor

2opremio commented May 9, 2016

On the three worker machines of the service the CPU consumption of the 0.15 candidate is over the roof

screen shot 2016-05-09 at 3 17 35 pm

Also, reports are being dropped. Probably for the same reason:

<probe> WARN: 2016/05/09 14:22:16.781555 Docker reporter took longer than 1s
<probe> ERRO: 2016/05/09 14:22:23.174719 Dropping report to 10.0.26.13:4040
<probe> WARN: 2016/05/09 14:22:25.812799 Docker reporter took longer than 1s
<probe> ERRO: 2016/05/09 14:22:27.388793 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:29.333923 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:32.135056 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:38.283660 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:41.262928 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:44.862768 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:50.425897 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:53.383612 Dropping report to 10.0.26.13:4040
<probe> WARN: 2016/05/09 14:22:55.870504 Docker reporter took longer than 1s
<probe> ERRO: 2016/05/09 14:22:57.934171 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:22:59.411752 Dropping report to 10.0.26.13:4040
<probe> WARN: 2016/05/09 14:23:01.011165 Docker reporter took longer than 1s
<probe> WARN: 2016/05/09 14:23:02.742918 Docker reporter took longer than 1s
<probe> ERRO: 2016/05/09 14:23:03.517608 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:23:06.190370 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:23:08.418448 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:23:11.232937 Dropping report to 10.0.26.13:4040
<probe> ERRO: 2016/05/09 14:23:14.856745 Dropping report to 10.0.26.13:4040

Resulting in a half-assed visualization of the service:

screen shot 2016-05-09 at 3 44 09 pm

Note how an app-mapper, a frontend and the ui-servers are missing

Profile:
pprof.localhost:4041.samples.cpu.001.pb.gz
probe_cpu

@2opremio 2opremio added this to the 0.15.0 milestone May 9, 2016
@2opremio
Copy link
Contributor Author

2opremio commented May 9, 2016

There are ~200 containers per machine from which less than 100 are running.

In order to tweak this, kubelet's arguments need to be adjusted http://kubernetes.io/docs/admin/garbage-collection/ so it seems we should be able to support that number of containers.

@2opremio
Copy link
Contributor Author

2opremio commented May 9, 2016

Here's the output of go tool pprof -focus 'GetNode' -png pprof.localhost\:4041.samples.cpu.001.pb.gz

getnode

@2opremio
Copy link
Contributor Author

2opremio commented May 9, 2016

It seems we are building the nodes for the report in every reporter iteration which, for containers which didn't change, are wasted CPU cycles.

I will try caching the nodes and only regenerating then when they are affected by a docker event.

@tomwilkie
Copy link
Contributor

The "Dropping report" is a new logging line added this release; it will be dropping reports due to the app being slow, not the probe.

@2opremio
Copy link
Contributor Author

The "Dropping report" is a new logging line added this release; it will be dropping reports due to the app being slow, not the probe.

Good to know, then that's #1457

This was referenced May 10, 2016
@2opremio 2opremio self-assigned this May 11, 2016
@2opremio 2opremio added the performance Excessive resource usage and latency; usually a bug or chore label May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
Development

No branches or pull requests

2 participants