Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU consumption in the app #1457

Open
2opremio opened this issue May 10, 2016 · 26 comments
Open

High CPU consumption in the app #1457

2opremio opened this issue May 10, 2016 · 26 comments
Assignees
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality performance Excessive resource usage and latency; usually a bug or chore
Milestone

Comments

@2opremio
Copy link
Contributor

On the service, with 4 nodes (probes) the app is over 100% CPU

screen shot 2016-05-10 at 07 34 46

Profile: pprof.localhost:4040.samples.cpu.001.pb.gz

app_cpu

The profile shows 80% CPU (maybe I was seeing garbage collection spikes?)

Perhaps the cache removed on #1447 would help?

@2opremio 2opremio added this to the 0.15.0 milestone May 10, 2016
@2opremio
Copy link
Contributor Author

2opremio commented May 10, 2016

This is not really a degradation compared to what we had before (see #854 ) but not what I would had expected after #1418 .

@2opremio
Copy link
Contributor Author

Also, the pressure of gzip is pretty high (just like on #1454 ), we could try getting a better performant compressor library or reduce the compression level until we have a solution for #1201

@tomwilkie
Copy link
Contributor

@2opremio what version was this? Did it contain #1418?

@2opremio
Copy link
Contributor Author

2opremio commented May 10, 2016

Master, as I mentioned both #1418 and #1447 are in

@2opremio
Copy link
Contributor Author

Reverting 2dae035 makes it even worse

screen shot 2016-05-10 at 09 32 15

@2opremio
Copy link
Contributor Author

2opremio commented May 10, 2016

I've tried using https://github.com/klauspost/compress/gzip but it doesn't seem to make a big difference. @davkal , have the requests to /api/topology increased in this release?

@janwillies
Copy link

We are seeing over 100% CPU usage in the app, with 5 probes connected.

Performance seems to be ok, though

This is with :latest from yesterday

@2opremio
Copy link
Contributor Author

Performance seems to be ok, though

You mean that the app is usable?

@davkal
Copy link
Contributor

davkal commented May 10, 2016

have the requests to /api/topology increased in this release?

Yes they have. Whenever you click on the search field, all topologies are fetched. But only then, no recurring timer.

@2opremio
Copy link
Contributor Author

@davkal I meant without search (this is master)

@davkal
Copy link
Contributor

davkal commented May 10, 2016

I meant without search (this is master)

Not from the frontend. Just checked git diff v0.14.0..origin/master -- client/app/scripts/utils/web-api-utils.js. No refresh intervals changed.

@janwillies
Copy link

Yes the app is usable. We are using m3.large instances
Am 10.05.2016 12:48 schrieb "Alfonso Acosta" [email protected]:

Performance seems to be ok, though

You mean that the app is usable?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1457 (comment)

@2opremio 2opremio mentioned this issue May 10, 2016
12 tasks
@2opremio
Copy link
Contributor Author

@tomwilkie Could this be a bug on #1418 ?

@2opremio
Copy link
Contributor Author

2opremio commented May 10, 2016

I think I know what's happening. We started including container ENV variables which are huge in our case due to kubernetes services.

More than 700 env variables per container 😮 which leads to well over 100K env variables in total:

screen shot 2016-05-10 at 13 11 53

screen shot 2016-05-10 at 13 13 21

I will try to truncate them and see if it helps.

@2opremio
Copy link
Contributor Author

@janwillies Can you check the number of env variables in your containers?

@janwillies
Copy link

170 per container, times ~100

2016-05-10 14:23 GMT+02:00 Alfonso Acosta [email protected]:

@janwillies https://github.com/janwillies Can you check the number of
env variables in your containers?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1457 (comment)

@2opremio
Copy link
Contributor Author

@janwillies Cool. Then truncating them might help.

This was referenced May 10, 2016
@2opremio 2opremio self-assigned this May 11, 2016
@2opremio
Copy link
Contributor Author

2opremio commented May 12, 2016

@janwillies Reports that the app is still consuming 129% CPU with 5 nodes

@paulbellamy paulbellamy modified the milestones: Pre-1.0, 0.15.0 May 12, 2016
@2opremio 2opremio added the performance Excessive resource usage and latency; usually a bug or chore label May 13, 2016
@willejs
Copy link

willejs commented May 25, 2016

I am also seeing similar issues.

@idcrosby
Copy link

Attaching pprof data from our cluster on AWS seeing high CPU issue:
cc @2opremio
pprof.localhost:4040.samples.cpu.001.pb.gz

@2opremio
Copy link
Contributor Author

This is the profile from @idcrosby, nothing too different from the previous profiles in this issue except for the futex contention on notesleep.

app

@2opremio 2opremio modified the milestones: Pre-1.0, July2016 Jun 30, 2016
@rade rade added the bug Broken end user or developer functionality; not working as the developers intended it label Jul 4, 2016
rade added a commit that referenced this issue Jul 28, 2016
Merge() is always returning a copy, so there is no need to Copy()
struct fields first before calling Merge() on them.

This reduces GC pressure (#1010) and helps overall app performance
(#1457).
@rade rade modified the milestones: July2016, August2016 Aug 4, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Aug 16, 2016

@janwillies @willejs @idcrosby Scope 0.17 is out and should improve the App's CPU consumption by at least 50% . Please let us know if the performance is acceptable now.

@rade rade added chore Related to fix/refinement/improvement of end user or new/existing developer functionality and removed bug Broken end user or developer functionality; not working as the developers intended it labels Aug 16, 2016
@rade rade modified the milestones: 0.18/1.0, October2016 Sep 15, 2016
@alban
Copy link
Contributor

alban commented Apr 11, 2017

Also, the pressure of gzip is pretty high (just like on #1454 ), we could try getting a better performant compressor library or reduce the compression level until we have a solution for #1201

@2opremio I tried various compression level with kinvolk-archives@37cc006 and I get the following:

Compression Level Report size in bytes Time elapsed (*10)
0 (gzip.NoCompression) 1568604 340ms
1 (gzip.BestSpeed) 114774 234ms
2 97085 281ms
3 94018 256ms
4 91108 319ms
5 85224 366ms
6 (also using gzip.DefaultCompression == -1) 80634 411ms
7 79051 410ms
8 78691 566ms
9 (gzip.BestCompression) 77781 618ms

So if Scope were using gzip.DefaultCompression instead of gzip.BestCompression, the reports would be 4% bigger and compressed 33% faster.

I have not checked the decompression speed in the app.

@2opremio
Copy link
Contributor Author

Good job. I thought we were using the default compression. Also, it may be worth checking other compression libraries

@alban
Copy link
Contributor

alban commented Apr 12, 2017

The list of compression https://golang.org/pkg/compress/

bzip2 is not suitable because it has only decompression and not compression (see golang/go#4828)

File size: 1662592 (msgpack uncompressed)

test name level compressed size compression time decompression time
gzip -1 143052 43761 8314
gzip 0 1662745 1912 2494
gzip 1 184074 15015 8360
gzip 2 170205 15406 7611
gzip 3 164116 22564 7013
gzip 4 160214 33903 7382
gzip 5 153660 35916 6992
gzip 6 143052 46799 7414
gzip 7 141689 56130 6422
gzip 8 138420 107412 7927
gzip 9 138127 170985 6594
zlib -1 143040 45800 7895
zlib 0 1662733 3070 2974
zlib 1 184062 16524 11179
zlib 2 170193 21051 9190
zlib 3 164104 19900 8685
zlib 4 160202 29980 9135
zlib 5 153648 38100 9263
zlib 6 143040 43449 8175
zlib 7 141677 65277 7140
zlib 8 138408 109087 7451
zlib 9 138115 177843 7175
flate -1 143034 48442 7317
flate 0 1662727 1493 1379
flate 1 184056 17335 8916
flate 2 170187 15605 7419
flate 3 164098 16493 7071
flate 4 160196 30576 8306
flate 5 153642 35537 7373
flate 6 143034 43456 8109
flate 7 141671 55900 6431
flate 8 138402 106750 6652
flate 9 138109 174859 6365
lzw 0 551151 32472 23047

I saved a msgpack file (by using kinvolk-archives@37cc006, then used gunzip on the file from /proc/$(pidof scope-probe)/root/tmp/report_0 and then renamed it to report.msgpack). I used https://gist.github.com/alban/4663d28f4cf69db4bb1d6dc365171e2c to produce this table.

Unfortunately, a lower compression level seems to imply a slower decompression. This is important since this bug is about the app.

In this table, flat with level=7 seems overall better for my msgpack file. But this is variable and when running the test another time, gzip with level=6 was the best overall.

alban added a commit to kinvolk-archives/scope that referenced this issue Apr 12, 2017
We want the middle ground between a small compression size, a fast
compression time and a fast decompression time.

Tests suggest that the default compression level is better than the
maximum compression level: although the reports are 4% bigger and
decompress slower, they compress 33% faster.

See discussion on weaveworks#1457 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
Development

No branches or pull requests

9 participants