Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce IncreasingSleepingWaitStrategy to reduce CPU consumption #446

Merged

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Jan 24, 2019

fixes #443

@felixbarny
Copy link
Member Author

Build of the changes: elastic-apm-agent-1.3.1-SNAPSHOT.jar.zip

@felixbarny
Copy link
Member Author

felixbarny commented Jan 24, 2019

Quick benchmark results:

PhasedBackoffWaitStrategy.withLiteLock(1, 10, TimeUnit.MILLISECONDS))

                             Mode     Cnt         Score         Error       Units
                           sample  227784         1.526 ±       0.018       us/op
benchmarkWithApm·p0.00     sample                 1.108                     us/op
benchmarkWithApm·p0.50     sample                 1.312                     us/op
benchmarkWithApm·p0.90     sample                 1.508                     us/op
benchmarkWithApm·p0.95     sample                 1.702                     us/op
benchmarkWithApm·p0.99     sample                 4.569                     us/op
benchmarkWithApm·p0.999    sample                24.544                     us/op
benchmarkWithApm·p0.9999   sample                62.866                     us/op
benchmarkWithApm·p1.00     sample               676.864                     us/op
·cpu.time.norm             sample      10      2849.960 ±      87.446       ns/op
·gc.alloc.rate             sample      10         0.362 ±       0.057      MB/sec
·gc.alloc.rate.norm        sample      10         0.792 ±       0.123        B/op
·gc.count                  sample      10           ≈ 0                    counts
·reporter.dropped          sample      10   1108769.300 ±   25750.187    events/s
·reporter.reported         sample      10    342745.200 ±  186436.651    events/s
·server.received.bytes     sample      10  12636857.400 ± 6862866.538     bytes/s
·server.received.payloads  sample      10         1.300 ±       1.020  payloads/s

new IncreasingSleepingWaitStrategy(1_000_000, 250_000_000, 2_500_000)

                             Mode     Cnt        Score         Error       Units
                           sample  240602        1.445 ±       0.015       us/op
benchmarkWithApm·p0.00     sample                1.024                     us/op
benchmarkWithApm·p0.50     sample                1.238                     us/op
benchmarkWithApm·p0.90     sample                1.436                     us/op
benchmarkWithApm·p0.95     sample                1.602                     us/op
benchmarkWithApm·p0.99     sample                4.000                     us/op
benchmarkWithApm·p0.999    sample               27.680                     us/op
benchmarkWithApm·p0.9999   sample               58.612                     us/op
benchmarkWithApm·p1.00     sample              268.288                     us/op
·cpu.time.norm             sample      10     2163.690 ±     210.907       ns/op
·gc.alloc.rate             sample      10        0.269 ±       0.075      MB/sec
·gc.alloc.rate.norm        sample      10        0.557 ±       0.144        B/op
·gc.count                  sample      10          ≈ 0                    counts
·reporter.dropped          sample      10  1359964.500 ±   76836.502    events/s
·reporter.reported         sample      10   176468.600 ±  230671.900    events/s
·server.received.bytes     sample      10  6458244.100 ± 8439860.685     bytes/s
·server.received.payloads  sample      10        0.700 ±       1.020  payloads/s

LockFreeSingleConsumerBlockingWaitStrategy (census-instrumentation/opencensus-java#1618)

                             Mode     Cnt         Score         Error       Units
                           sample  218223         1.580 ±       0.014       us/op
benchmarkWithApm·p0.00     sample                 1.090                     us/op
benchmarkWithApm·p0.50     sample                 1.408                     us/op
benchmarkWithApm·p0.90     sample                 1.608                     us/op
benchmarkWithApm·p0.95     sample                 1.738                     us/op
benchmarkWithApm·p0.99     sample                 4.360                     us/op
benchmarkWithApm·p0.999    sample                22.617                     us/op
benchmarkWithApm·p0.9999   sample                48.599                     us/op
benchmarkWithApm·p1.00     sample               395.776                     us/op
·cpu.time.norm             sample      10      2980.680 ±     125.519       ns/op
·gc.alloc.rate             sample      10         0.365 ±       0.029      MB/sec
·gc.alloc.rate.norm        sample      10         0.830 ±       0.066        B/op
·gc.count                  sample      10           ≈ 0                    counts
·reporter.dropped          sample      10   1097172.700 ±   20729.662    events/s
·reporter.reported         sample      10    309336.200 ±  124615.513    events/s
·server.received.bytes     sample      10  11444661.000 ± 4585421.760     bytes/s
·server.received.payloads  sample      10         1.100 ±       0.478  payloads/s

CPU usage and latency is best with IncreasingSleepingWaitStrategy but at the expense of a higher number of dropped events.

@felixbarny
Copy link
Member Author

felixbarny commented Jan 24, 2019

Here's another one:

new ExponentionallyIncreasingSleepingWaitStrategy(100_000, 100_000_000)

                             Mode     Cnt         Score         Error       Units
                           sample  236977         1.452 ±       0.015       us/op
benchmarkWithApm·p0.00     sample                 1.048                     us/op
benchmarkWithApm·p0.50     sample                 1.278                     us/op
benchmarkWithApm·p0.90     sample                 1.470                     us/op
benchmarkWithApm·p0.95     sample                 1.600                     us/op
benchmarkWithApm·p0.99     sample                 3.560                     us/op
benchmarkWithApm·p0.999    sample                23.137                     us/op
benchmarkWithApm·p0.9999   sample                59.539                     us/op
benchmarkWithApm·p1.00     sample               486.912                     us/op
·cpu.time.norm             sample      10      2718.679 ±     166.314       ns/op
·gc.alloc.rate             sample      10         0.344 ±       0.057      MB/sec
·gc.alloc.rate.norm        sample      10         0.721 ±       0.135        B/op
·gc.count                  sample      10           ≈ 0                    counts
·reporter.dropped          sample      10   1196763.500 ±   33600.786    events/s
·reporter.reported         sample      10    315185.800 ±  135705.593    events/s
·server.received.bytes     sample      10  11614995.500 ± 4984982.371     bytes/s
·server.received.payloads  sample      10         1.200 ±       0.637  payloads/s

The CPU savings of IncreasingSleepingWaitStrategy are probably mostly due to the increase of dropped events. This strategy has almost no hit in the max reported events.

@felixbarny felixbarny self-assigned this Jan 25, 2019
@felixbarny felixbarny mentioned this pull request Jan 28, 2019
This should still be very light on CPU but reduces the risk of overflowing the queue
@codecov-io
Copy link

Codecov Report

Merging #446 into master will increase coverage by 6.29%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #446      +/-   ##
============================================
+ Coverage     63.28%   69.57%   +6.29%     
- Complexity     1336     1346      +10     
============================================
  Files           164      152      -12     
  Lines          6286     5433     -853     
  Branches        715      570     -145     
============================================
- Hits           3978     3780     -198     
+ Misses         2063     1416     -647     
+ Partials        245      237       -8
Impacted Files Coverage Δ Complexity Δ
...co/elastic/apm/agent/report/ApmServerReporter.java 51% <ø> (-8.41%) 12 <0> (-1)
...ExponentionallyIncreasingSleepingWaitStrategy.java 100% <100%> (ø) 6 <6> (?)
...apm/agent/impl/transaction/TraceContextHolder.java 47.82% <0%> (-9.08%) 9% <0%> (ø)
...va/co/elastic/apm/agent/impl/ElasticApmTracer.java 78.76% <0%> (-2.95%) 39% <0%> (ø)
...java/co/elastic/apm/agent/bci/ElasticApmAgent.java 72.22% <0%> (-0.7%) 22% <0%> (+1%)
.../apm/agent/report/serialize/DslJsonSerializer.java 82.37% <0%> (-0.17%) 121% <0%> (ø)
...gent/opentracing/impl/ApmScopeInstrumentation.java
...t/v5_6/ElasticsearchRestClientInstrumentation.java
...t/opentracing/impl/SpanContextInstrumentation.java
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e13465...f9b2f26. Read the comment docs.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great solution! Dropping events is what we are going to end up with anyway when there is too much to ingest, so why bother with busy waits and notifications..
0.1-10 ms looks more than enough (low) so as long as we get satisfactory performance metrics lets go with it

@felixbarny felixbarny merged commit 24424d6 into elastic:master Jan 28, 2019
@felixbarny felixbarny deleted the increasing-sleeping-wait-strategy branch January 28, 2019 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider more CPU friendly disruptor wait strategies
4 participants