modelindexer: Run active in dedicated goroutine #9318

marclop · 2022-10-10T02:59:15Z

Motivation/summary

Remove the i.activeMu from the modelindexer and shared sync primitives
from the indexer structure, instead, run the active indexer in its own
goroutine, minimizing the amount of shared locking that needs to happen
when ProcessBatch is called. To allow (potentially) multiple active
indexers from running in parallel, a new internal bulkItems channel is
introduced, with a variable buffer depending on how powerful the machine
where APM Server is run. The bigger the machine, the bigger the channel
buffer, and vice versa.

Additionally, the default FlushBytes reduced from 5MB to 2MB, and
to keep the same total "bufferable" size, the default MaxRequests is
increased to 25 (25 * 2mb = 50mb). The motivation behind this change is
is twofold; Allowing the modelindexer to have more background flushes
if enough events are being processed by the APM Server, and, creating
slightly smaller bulk requests to Elasticsearch, since, with the default
settings (5MB FlushBytes), the number of bulk documents is ~18-24K.

Initial micro-benchmarks indicate that the performance improvement is
~42-50% with default settings, and macro-benchmarks ~27%:

$ benchstat old.txt single-pr.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.14µs ± 2%   +3.53%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.70µs ± 1%  -15.90%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.36µs ± 1%  -42.53%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.26µs ± 1%  -38.71%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.54%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.33%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.53kB ± 0%   -0.79%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.55kB ± 0%   -0.96%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)

name          old events/sec  new events/sec  delta
AgentAll-512  26.8k ± 0%      34.1k ±11%      +27.11%  (p=0.100 n=3+3)

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

TODO

Related issues

Closes #9180

Remove the `i.activeMu` from the modelindexer and shared sync primitives from the indexer structure, instead, run the active indexer in its own goroutine, minimizing the amount of shared locking that needs to happen when `ProcessBatch` is called. To allow (potentially) multiple active indexers from running in parallel, a new internal `bulkItems` channel is introduced, with a variable buffer depending on how powerful the machine where APM Server is run. The bigger the machine, the bigger the channel buffer, and vice versa. Additionally, the default `FlushBytes` reduced from `5MB` to `2MB`, and to keep the same total "bufferable" size, the default `MaxRequests` is increased to 25 (25 * 2mb = 50mb). The motivation behind this change is is twofold; Allowing the modelindexer to have more background flushes if enough events are being processed by the APM Server, and, creating slightly smaller bulk requests to Elasticsearch, since, with the default settings (5MB FlushBytes), the number of bulk documents is ~`18-24K`. Initial micro-benchmarks indicate that the performance improvement is ~42% with default settings, and macro-benchmarks ~27%: ``` benchstat old.txt single-lockless.txt name old time/op new time/op delta NoCompression-8 1.11µs ± 1% 1.15µs ± 4% +4.41% (p=0.008 n=5+5) BestSpeed-8 2.03µs ± 3% 1.88µs ± 1% -7.14% (p=0.008 n=5+5) DefaultCompression-8 4.10µs ± 2% 2.36µs ± 2% -42.32% (p=0.008 n=5+5) BestCompression-8 6.95µs ± 1% 4.21µs ± 1% -39.48% (p=0.008 n=5+5) name old alloc/op new alloc/op delta NoCompression-8 2.07kB ± 0% 2.14kB ± 0% +3.44% (p=0.016 n=4+5) BestSpeed-8 2.57kB ± 0% 2.56kB ± 0% -0.31% (p=0.008 n=5+5) DefaultCompression-8 2.55kB ± 0% 2.52kB ± 0% -0.87% (p=0.008 n=5+5) BestCompression-8 2.58kB ± 0% 2.57kB ± 0% -0.52% (p=0.008 n=5+5) ``` ``` name old events/sec new events/sec delta AgentAll-512 26.8k ± 0% 34.1k ±11% +27.11% (p=0.100 n=3+3) ``` Signed-off-by: Marc Lopez Rubio <[email protected]>

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop · 2022-10-10T03:04:01Z

I've included two commits of the implementation, the first one uses a local sync.Mutex in the active goroutine, not dissimilar to how the model indexer used locking previously (451e7f2).

$ benchstat old.txt single.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.19µs ±15%   +7.61%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.89µs ± 2%   -6.91%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.34µs ± 1%  -42.98%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.20µs ± 1%  -39.58%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.30%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.35%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.52kB ± 0%   -0.83%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.56kB ± 0%   -0.62%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)

The second one implements the same "algorithm" but without using channels instead of a mutex. It seems to have comparable performance: (ef72d77):

$ benchstat old.txt single-lockless.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.15µs ± 4%   +4.41%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.88µs ± 1%   -7.14%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.36µs ± 2%  -42.32%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.21µs ± 1%  -39.48%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.44%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.31%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.52kB ± 0%   -0.87%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.57kB ± 0%   -0.52%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)

$ benchstat single.txt single-lockless.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.19µs ±15%    1.15µs ± 4%   ~     (p=0.841 n=5+5)
ModelIndexer/BestSpeed-8             1.89µs ± 2%    1.88µs ± 1%   ~     (p=0.889 n=5+5)
ModelIndexer/DefaultCompression-8    2.34µs ± 1%    2.36µs ± 2%   ~     (p=0.548 n=5+5)
ModelIndexer/BestCompression-8       4.20µs ± 1%    4.21µs ± 1%   ~     (p=0.690 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.14kB ± 0%    2.14kB ± 0%   ~     (p=0.579 n=5+5)
ModelIndexer/BestSpeed-8             2.56kB ± 0%    2.56kB ± 0%   ~     (p=0.667 n=5+5)
ModelIndexer/DefaultCompression-8    2.52kB ± 0%    2.52kB ± 0%   ~     (p=0.373 n=5+5)
ModelIndexer/BestCompression-8       2.56kB ± 0%    2.57kB ± 0%   ~     (p=0.905 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%   ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%   ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%   ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%   ~     (all equal)

Signed-off-by: Marc Lopez Rubio <[email protected]>

apmmachine · 2022-10-10T03:12:28Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-10-12T01:42:04.547+0000
Duration: 29 min 48 sec

Test stats 🧪

Test	Results
Failed	0
Passed	150
Skipped	0
Total	150

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2022-10-10T03:37:32Z

📚 Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
FetchAndAdd/FetchFromCache-12                                                                       41.4ns ± 1%    46.2ns ± 1%   +11.68%  (p=0.008 n=5+5)
FetchAndAdd/FetchAndAddToCache-12                                                                   96.8ns ± 2%   107.5ns ± 2%   +11.12%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/Forwarded_ipv4-12                                                                      1.07µs ±19%    0.82µs ±25%   -23.49%  (p=0.032 n=5+5)
ContextResetContentEncoding/empty-12                                                                 118ns ± 1%     131ns ± 0%   +11.31%  (p=0.008 n=5+5)
ContextResetContentEncoding/uncompressed-12                                                          150ns ± 1%     167ns ± 1%   +11.34%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       2.01µs ±25%    1.14µs ±23%   -43.30%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  6.48µs ±23%    3.14µs ± 4%   -51.61%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-12                                                                     10.9µs ±42%     5.6µs ± 5%   -48.38%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/invalid-metadata-2.ndjson-12                                                       8.04µs ± 4%    6.62µs ±10%   -17.71%  (p=0.008 n=5+5)
BackendProcessor/invalid-metadata.ndjson-12                                                         7.87µs ± 6%    6.76µs ±31%   -14.08%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12            3.82µs ± 3%    4.12µs ± 5%    +7.92%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-metadata-2.ndjson-12            1.61µs ± 3%    1.69µs ± 3%    +4.71%  (p=0.048 n=5+4)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata-null-values.ndjson-12          2.65µs ± 2%    2.75µs ± 3%    +3.79%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata.ndjson-12                      4.47µs ± 3%    4.91µs ± 7%    +9.73%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/minimal.ndjson-12                       4.09µs ± 9%    4.49µs ± 5%    +9.67%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/unknown-span-type.ndjson-12             15.0µs ± 6%    13.7µs ± 2%    -8.79%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                    2.39µs ± 1%    2.42µs ± 1%    +1.23%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12            10.7µs ± 1%    11.1µs ± 2%    +3.46%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12        1.97µs ± 1%    2.02µs ± 1%    +2.56%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12             7.20µs ± 1%    7.30µs ± 1%    +1.37%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                      6.87µs ± 1%    7.38µs ± 1%    +7.44%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                      13.0µs ± 1%    13.1µs ± 0%    +1.17%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12       1.86µs ± 1%    1.88µs ± 1%    +1.40%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12           468ns ± 2%     489ns ± 1%    +4.36%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12      1.57µs ± 1%    1.59µs ± 1%    +1.15%  (p=0.016 n=5+5)
ReadBatch/metadata.ndjson-12                                                                        19.9µs ± 9%    14.9µs ±27%   -25.21%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
AggregateSpan-12                                                                                    1.07µs ±25%    1.38µs ± 6%   +28.45%  (p=0.016 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/json_codec-12                                                                      4.16µs ±19%   10.45µs ±59%  +150.99%  (p=0.008 n=5+5)
WriteTransaction/json_codec_big_tx-12                                                               4.83µs ± 3%   11.29µs ±76%  +134.08%  (p=0.008 n=5+5)
ReadEvents/json_codec/0_events-12                                                                    314ns ± 8%     354ns ± 3%   +12.67%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/0_events-12                                                             307ns ± 4%     337ns ± 5%   +10.03%  (p=0.008 n=5+5)
ReadEvents/nop_codec/0_events-12                                                                     314ns ± 2%     348ns ±10%   +10.95%  (p=0.008 n=5+5)
ReadEvents/nop_codec_big_tx/0_events-12                                                              307ns ± 6%     337ns ± 3%    +9.63%  (p=0.016 n=5+5)
IsTraceSampled/sampled-12                                                                           66.9ns ± 2%    79.8ns ± 5%   +19.30%  (p=0.008 n=5+5)
IsTraceSampled/unsampled-12                                                                         69.2ns ± 1%    76.5ns ± 2%   +10.66%  (p=0.008 n=5+5)
IsTraceSampled/unknown-12                                                                            391ns ± 3%     427ns ± 3%    +9.06%  (p=0.008 n=5+5)

name                                                                                              old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       3.31kB ± 2%    2.98kB ± 0%    -9.79%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  2.65kB ± 2%    2.59kB ± 1%    -2.26%  (p=0.048 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors_rum.ndjson-12                                                               7.90kB ± 2%    7.72kB ± 1%    -2.24%  (p=0.032 n=5+5)
BackendProcessor/transactions.ndjson-12                                                             26.6kB ± 1%    26.2kB ± 1%    -1.28%  (p=0.008 n=5+5)
BackendProcessor/transactions_spans_rum_2.ndjson-12                                                 5.55kB ± 1%    5.45kB ± 2%    -1.78%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12            3.13kB ± 2%    3.09kB ± 1%    -1.40%  (p=0.024 n=5+5)
ReadBatch/errors.ndjson-12                                                                          20.6kB ± 0%    20.6kB ± 0%    -0.08%  (p=0.040 n=5+5)
ReadBatch/invalid-event-type.ndjson-12                                                              2.64kB ± 0%    2.65kB ± 0%    +0.04%  (p=0.016 n=4+5)
ReadBatch/otel-bridge.ndjson-12                                                                     10.0kB ± 0%    10.0kB ± 0%    +0.08%  (p=0.024 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/json_codec_big_tx-12                                                               3.90kB ± 0%    3.90kB ± 0%    +0.03%  (p=0.029 n=4+4)
ReadEvents/json_codec/199_events-12                                                                 1.10MB ± 0%    1.10MB ± 0%    -0.03%  (p=0.040 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12                                                         4.88MB ± 0%    4.87MB ± 0%    -0.26%  (p=0.032 n=5+5)

name                                                                                              old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/heavy.ndjson-12                                                                     22.3k ± 0%     22.3k ± 0%    -0.01%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed      delta
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/invalid-metadata-2.ndjson-12                                                     54.2MB/s ± 4%  66.2MB/s ±10%   +22.03%  (p=0.008 n=5+5)
BackendProcessor/invalid-metadata.ndjson-12                                                       56.8MB/s ± 5%  68.3MB/s ±41%   +20.37%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12           154MB/s ± 3%   143MB/s ± 6%    -7.25%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata-null-values.ndjson-12         198MB/s ± 2%   191MB/s ± 3%    -3.64%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata.ndjson-12                     278MB/s ± 3%   254MB/s ± 7%    -8.62%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/minimal.ndjson-12                      253MB/s ± 9%   230MB/s ± 5%    -8.97%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/unknown-span-type.ndjson-12            221MB/s ± 7%   242MB/s ± 2%    +9.47%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                   795MB/s ± 1%   785MB/s ± 1%    -1.21%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12           545MB/s ± 1%   526MB/s ± 2%    -3.34%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12       588MB/s ± 1%   573MB/s ± 1%    -2.51%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12            459MB/s ± 1%   453MB/s ± 1%    -1.34%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                     924MB/s ± 1%   860MB/s ± 1%    -6.92%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                     574MB/s ± 1%   567MB/s ± 0%    -1.16%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12      240MB/s ± 1%   237MB/s ± 1%    -1.36%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12         932MB/s ± 2%   893MB/s ± 1%    -4.19%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12     735MB/s ± 1%   727MB/s ± 1%    -1.12%  (p=0.016 n=5+5)
ReadBatch/metadata.ndjson-12                                                                      62.6MB/s ±10%  85.0MB/s ±23%   +35.75%  (p=0.032 n=5+5)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

This reverts commit ef72d77.

marclop · 2022-10-10T04:48:47Z

I think the channel lockless approach has a race condition which causes the active indexer to never be flushed on the idle timer. I haven't managed to track down where/how it happens, so it may be best to leave that approach as a future optimization and update this PR to use the local mutex, which seems to be equal performance wise.

Signed-off-by: Marc Lopez Rubio <[email protected]>

axw

Looking good! This is starting to look a bit like the APM Go Agent's tracer loop :)

internal/model/modelindexer/indexer.go

Signed-off-by: Marc Lopez Rubio <[email protected]>

internal/model/modelindexer/indexer.go

Signed-off-by: Marc Lopez Rubio <[email protected]>

axw

LGTM! Might be worth running the benchmarks one more time, after the changes?

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop · 2022-10-11T10:34:21Z

@axw benchmarks look good however I do notice an increase in overall RSS which could be problematic (and is) for 1GB APM Servers, and cause them to run out of memory.

Bear in mind that I have done this testing with a slightly smaller ES deployment (6 nodes instead of 12, totaling 116g, rather than 232g). Even then, we've got a ~20% performance increase in all sizes except 1gb and 2GB` which is impressive.

2GB

$ name          old time/op                  new time/op                  delta
AgentAll-128                   2.18s ± 0%                   2.13s ± 0%    -2.20%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-128                   8.28k ± 0%                   8.46k ± 0%    +2.21%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-128                     146 ± 8%                     198 ±21%   +35.70%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-128                    896M ± 1%                   1059M ± 6%   +18.24%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-128                   8.24M ± 1%                   9.49M ± 4%   +15.17%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-128                    965M ± 1%                   1167M ± 5%   +20.98%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-128                   558MB ± 0%                   560MB ± 0%    +0.37%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-128                   7.93M ± 0%                   7.94M ± 0%    +0.10%  (p=0.100 n=3+3)

4GB

$ name          old time/op                  new time/op                  delta
AgentAll-256                   1.34s ± 1%                   1.12s ± 0%   -16.57%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-256                   13.4k ± 1%                   16.1k ± 0%   +19.87%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-256                     176 ± 8%                     250 ± 4%   +41.97%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-256                    914M ± 2%                   1096M ± 0%   +19.95%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-256                   8.14M ± 1%                   9.81M ± 1%   +20.46%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-256                    987M ± 1%                   1195M ± 1%   +21.10%  (p=0.100 n=3+3)

name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-256                    5.78 ± 1%                   19.82 ± 0%  +242.65%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-256                   566MB ± 0%                   560MB ± 0%    -1.06%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-256                   8.02M ± 0%                   7.94M ± 0%    -0.91%  (p=0.100 n=3+3)

8GB

$ name          old time/op                  new time/op                  delta
AgentAll-512                   673ms ± 0%                   566ms ± 8%   -15.94%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-512                   26.8k ± 0%                   32.0k ± 8%   +19.26%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-512                     284 ± 1%                     434 ± 6%   +52.93%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-512                    936M ± 1%                   1190M ± 2%   +27.12%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-512                   7.86M ± 2%                  10.50M ± 2%   +33.57%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-512                   1.04G ± 2%                   1.31G ± 2%   +25.60%  (p=0.100 n=3+3)

name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-512                    2.40 ± 1%                    9.59 ±14%  +299.17%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-512                   562MB ± 0%                   574MB ± 0%    +2.25%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-512                   7.98M ± 0%                   8.06M ± 0%    +0.94%  (p=0.100 n=3+3)

In terms of CPU usage, the old indexer only used about 50% CPU on an 8GB APM Server, while the new one goes as high as 67%.

I think we may need to revisit the semaphore size, particularly for smaller sizes and perhaps size it according to the amount of memory available to APM Server.

simitt

Changes generally look good. The higher memory consumption is concerning, will you address this in more detail in a follow up @marclop ?

marclop · 2022-10-12T00:57:05Z

I suspect that the culprit may be the internal model indexer channel size: bulkItems: make(chan elasticsearch.BulkIndexerItem, 1000),. In smaller instances that is quite a lot of events to buffer, it would probably be better to dynamically size the channel based on memory rather than always having a buffered channel with capacity for 1000 events as they now may take up to 1000 * model.APMEvent.

I think it may be better to reduce that channel size to something much smaller for now and look into dynamic sizing based on memory.

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop · 2022-10-12T01:49:28Z

After lowering the channel size to 100, the total memory increased is more contained: ~8% for 1G instances, and ~16% for the rest.

$ benchstat -alpha 0.11 sizes/1g-12s-12n-main.txt sizes/1g-6s-6n-final-100bc.txt
name         old time/op                  new time/op                  delta
AgentAll-64                   2.18s ± 0%                   2.17s ± 0%    -0.76%  (p=0.100 n=3+3)

name         old events/sec               new events/sec               delta
AgentAll-64                   8.26k ± 0%                   8.32k ± 0%    +0.73%  (p=0.100 n=3+3)

name         old max_goroutines           new max_goroutines           delta
AgentAll-64                     155 ± 6%                     174 ± 2%   +12.72%  (p=0.100 n=3+3)

name         old max_heap_alloc           new max_heap_alloc           delta
AgentAll-64                    860M ± 0%                    918M ± 1%    +6.77%  (p=0.100 n=3+3)

name         old max_heap_objects         new max_heap_objects         delta
AgentAll-64                   7.73M ± 2%                   8.45M ± 1%    +9.29%  (p=0.100 n=3+3)

name         old max_rss                  new max_rss                  delta
AgentAll-64                    928M ± 1%                   1011M ± 1%    +8.96%  (p=0.100 n=3+3)

name         old mean_available_indexers  new mean_available_indexers  delta
AgentAll-64                    6.98 ± 1%                   22.11 ± 1%  +216.76%  (p=0.100 n=3+3)

name         old alloc/op                 new alloc/op                 delta
AgentAll-64                   559MB ± 0%                   559MB ± 0%      ~     (p=0.700 n=3+3)

name         old allocs/op                new allocs/op                delta
AgentAll-64                   7.93M ± 0%                   7.93M ± 0%    -0.03%  (p=0.100 n=3+3)

I think we should merge this PR, and I'll work on a PR to resize both the channel and the semaphore size based on cgroup (or in its absence, machine) memory limit.

axw · 2022-11-24T08:07:11Z

This should be tested as part of #9182

marclop added 2 commits October 10, 2022 10:30

Remove mutex from activeIndexer

ef72d77

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop added enhancement backport-skip Skip notification from the automated backport with mergify v8.6.0 labels Oct 10, 2022

Fix monitoring test

2bd1911

Signed-off-by: Marc Lopez Rubio <[email protected]>

Revert "Remove mutex from activeIndexer"

764739f

This reverts commit ef72d77.

Update systemtest

6b61163

Signed-off-by: Marc Lopez Rubio <[email protected]>

axw reviewed Oct 10, 2022

View reviewed changes

Address review comments

5e5b24d

Signed-off-by: Marc Lopez Rubio <[email protected]>

axw reviewed Oct 11, 2022

View reviewed changes

internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved

internal/model/modelindexer/indexer.go Show resolved Hide resolved

marclop added 3 commits October 11, 2022 12:05

Update tests, more review comments

bfe6b5f

Signed-off-by: Marc Lopez Rubio <[email protected]>

Merge branch 'main' into f/lockless-modelindexer-process-batch

2d36646

Update changelog

92993f5

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop requested a review from axw October 11, 2022 04:11

axw approved these changes Oct 11, 2022

View reviewed changes

marclop added 3 commits October 11, 2022 12:24

Assert test error

3aa2a11

Signed-off-by: Marc Lopez Rubio <[email protected]>

Merge branch 'main' into f/lockless-modelindexer-process-batch

d0f2be8

Merge branch 'main' into f/lockless-modelindexer-process-batch

9969ef8

simitt reviewed Oct 11, 2022

View reviewed changes

Lower internal channel buffer to 100

594bd8e

Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop merged commit 28f259d into elastic:main Oct 12, 2022

marclop deleted the f/lockless-modelindexer-process-batch branch October 12, 2022 02:16

marclop mentioned this pull request Oct 13, 2022

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

Closed

marclop added the test-plan label Nov 14, 2022

axw removed the test-plan label Nov 24, 2022

kruskall assigned kruskall and unassigned kruskall Dec 6, 2022

kruskall mentioned this pull request Dec 7, 2022

Ensure APM Server makes full usage of available CPU resources #9182

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modelindexer: Run active in dedicated goroutine #9318

modelindexer: Run active in dedicated goroutine #9318

marclop commented Oct 10, 2022 •

edited

Loading

marclop commented Oct 10, 2022

apmmachine commented Oct 10, 2022 •

edited

Loading

Build stats

Test stats 🧪

apmmachine commented Oct 10, 2022 •

edited

Loading

marclop commented Oct 10, 2022 •

edited

Loading

axw left a comment

axw left a comment

marclop commented Oct 11, 2022

simitt left a comment

marclop commented Oct 12, 2022

marclop commented Oct 12, 2022 •

edited

Loading

axw commented Nov 24, 2022

modelindexer: Run active in dedicated goroutine #9318

modelindexer: Run active in dedicated goroutine #9318

Conversation

marclop commented Oct 10, 2022 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

marclop commented Oct 10, 2022

apmmachine commented Oct 10, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

apmmachine commented Oct 10, 2022 • edited Loading

📚 Go benchmark report

marclop commented Oct 10, 2022 • edited Loading

axw left a comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

marclop commented Oct 11, 2022

2GB

4GB

8GB

simitt left a comment

Choose a reason for hiding this comment

marclop commented Oct 12, 2022

marclop commented Oct 12, 2022 • edited Loading

axw commented Nov 24, 2022

marclop commented Oct 10, 2022 •

edited

Loading

apmmachine commented Oct 10, 2022 •

edited

Loading

apmmachine commented Oct 10, 2022 •

edited

Loading

marclop commented Oct 10, 2022 •

edited

Loading

marclop commented Oct 12, 2022 •

edited

Loading