Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modelindexer: Run active in dedicated goroutine #9318

Merged
merged 13 commits into from
Oct 12, 2022

Conversation

marclop
Copy link
Contributor

@marclop marclop commented Oct 10, 2022

Motivation/summary

Remove the i.activeMu from the modelindexer and shared sync primitives
from the indexer structure, instead, run the active indexer in its own
goroutine, minimizing the amount of shared locking that needs to happen
when ProcessBatch is called. To allow (potentially) multiple active
indexers from running in parallel, a new internal bulkItems channel is
introduced, with a variable buffer depending on how powerful the machine
where APM Server is run. The bigger the machine, the bigger the channel
buffer, and vice versa.

Additionally, the default FlushBytes reduced from 5MB to 2MB, and
to keep the same total "bufferable" size, the default MaxRequests is
increased to 25 (25 * 2mb = 50mb). The motivation behind this change is
is twofold; Allowing the modelindexer to have more background flushes
if enough events are being processed by the APM Server, and, creating
slightly smaller bulk requests to Elasticsearch, since, with the default
settings (5MB FlushBytes), the number of bulk documents is ~18-24K.

Initial micro-benchmarks indicate that the performance improvement is
~42-50% with default settings, and macro-benchmarks ~27%:

$ benchstat old.txt single-pr.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.14µs ± 2%   +3.53%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.70µs ± 1%  -15.90%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.36µs ± 1%  -42.53%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.26µs ± 1%  -38.71%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.54%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.33%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.53kB ± 0%   -0.79%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.55kB ± 0%   -0.96%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)
name          old events/sec  new events/sec  delta
AgentAll-512  26.8k ± 0%      34.1k ±11%      +27.11%  (p=0.100 n=3+3)

Checklist

How to test these changes

TODO

Related issues

Closes #9180

Remove the `i.activeMu` from the modelindexer and shared sync primitives
from the indexer structure, instead, run the active indexer in its own
goroutine, minimizing the amount of shared locking that needs to happen
when `ProcessBatch` is called. To allow (potentially) multiple active
indexers from running in parallel, a new internal `bulkItems` channel is
introduced, with a variable buffer depending on how powerful the machine
where APM Server is run. The bigger the machine, the bigger the channel
buffer, and vice versa.

Additionally, the default `FlushBytes` reduced from `5MB` to `2MB`, and
to keep the same total "bufferable" size, the default `MaxRequests` is
increased to 25 (25 * 2mb = 50mb). The motivation behind this change is
is twofold; Allowing the modelindexer to have more background flushes
if enough events are being processed by the APM Server, and, creating
slightly smaller bulk requests to Elasticsearch, since, with the default
settings (5MB FlushBytes), the number of bulk documents is ~`18-24K`.

Initial micro-benchmarks indicate that the performance improvement is
~42% with default settings, and macro-benchmarks ~27%:

```
benchstat old.txt single-lockless.txt
name                  old time/op  new time/op  delta
NoCompression-8       1.11µs ± 1%  1.15µs ± 4%   +4.41%  (p=0.008 n=5+5)
BestSpeed-8           2.03µs ± 3%  1.88µs ± 1%   -7.14%  (p=0.008 n=5+5)
DefaultCompression-8  4.10µs ± 2%  2.36µs ± 2%  -42.32%  (p=0.008 n=5+5)
BestCompression-8     6.95µs ± 1%  4.21µs ± 1%  -39.48%  (p=0.008 n=5+5)

name                  old alloc/op  new alloc/op  delta
NoCompression-8       2.07kB ± 0%   2.14kB ± 0%  +3.44%  (p=0.016 n=4+5)
BestSpeed-8           2.57kB ± 0%   2.56kB ± 0%  -0.31%  (p=0.008 n=5+5)
DefaultCompression-8  2.55kB ± 0%   2.52kB ± 0%  -0.87%  (p=0.008 n=5+5)
BestCompression-8     2.58kB ± 0%   2.57kB ± 0%  -0.52%  (p=0.008 n=5+5)
```

```
name          old events/sec  new events/sec  delta
AgentAll-512  26.8k ± 0%      34.1k ±11%      +27.11%  (p=0.100 n=3+3)
```

Signed-off-by: Marc Lopez Rubio <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>
@marclop marclop added enhancement backport-skip Skip notification from the automated backport with mergify v8.6.0 labels Oct 10, 2022
@marclop
Copy link
Contributor Author

marclop commented Oct 10, 2022

I've included two commits of the implementation, the first one uses a local sync.Mutex in the active goroutine, not dissimilar to how the model indexer used locking previously (451e7f2).

$ benchstat old.txt single.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.19µs ±15%   +7.61%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.89µs ± 2%   -6.91%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.34µs ± 1%  -42.98%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.20µs ± 1%  -39.58%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.30%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.35%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.52kB ± 0%   -0.83%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.56kB ± 0%   -0.62%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)

The second one implements the same "algorithm" but without using channels instead of a mutex. It seems to have comparable performance: (ef72d77):

$ benchstat old.txt single-lockless.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.11µs ± 1%    1.15µs ± 4%   +4.41%  (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8             2.03µs ± 3%    1.88µs ± 1%   -7.14%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    4.10µs ± 2%    2.36µs ± 2%  -42.32%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       6.95µs ± 1%    4.21µs ± 1%  -39.48%  (p=0.008 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.07kB ± 0%    2.14kB ± 0%   +3.44%  (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8             2.57kB ± 0%    2.56kB ± 0%   -0.31%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8    2.55kB ± 0%    2.52kB ± 0%   -0.87%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-8       2.58kB ± 0%    2.57kB ± 0%   -0.52%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%     ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%     ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%     ~     (all equal)
$ benchstat single.txt single-lockless.txt
name                               old time/op    new time/op    delta
ModelIndexer/NoCompression-8         1.19µs ±15%    1.15µs ± 4%   ~     (p=0.841 n=5+5)
ModelIndexer/BestSpeed-8             1.89µs ± 2%    1.88µs ± 1%   ~     (p=0.889 n=5+5)
ModelIndexer/DefaultCompression-8    2.34µs ± 1%    2.36µs ± 2%   ~     (p=0.548 n=5+5)
ModelIndexer/BestCompression-8       4.20µs ± 1%    4.21µs ± 1%   ~     (p=0.690 n=5+5)

name                               old alloc/op   new alloc/op   delta
ModelIndexer/NoCompression-8         2.14kB ± 0%    2.14kB ± 0%   ~     (p=0.579 n=5+5)
ModelIndexer/BestSpeed-8             2.56kB ± 0%    2.56kB ± 0%   ~     (p=0.667 n=5+5)
ModelIndexer/DefaultCompression-8    2.52kB ± 0%    2.52kB ± 0%   ~     (p=0.373 n=5+5)
ModelIndexer/BestCompression-8       2.56kB ± 0%    2.57kB ± 0%   ~     (p=0.905 n=5+5)

name                               old allocs/op  new allocs/op  delta
ModelIndexer/NoCompression-8           20.0 ± 0%      20.0 ± 0%   ~     (all equal)
ModelIndexer/BestSpeed-8               24.0 ± 0%      24.0 ± 0%   ~     (all equal)
ModelIndexer/DefaultCompression-8      24.0 ± 0%      24.0 ± 0%   ~     (all equal)
ModelIndexer/BestCompression-8         24.0 ± 0%      24.0 ± 0%   ~     (all equal)

Signed-off-by: Marc Lopez Rubio <[email protected]>
@apmmachine
Copy link
Contributor

apmmachine commented Oct 10, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-10-12T01:42:04.547+0000

  • Duration: 29 min 48 sec

Test stats 🧪

Test Results
Failed 0
Passed 150
Skipped 0
Total 150

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@apmmachine
Copy link
Contributor

apmmachine commented Oct 10, 2022

📚 Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
FetchAndAdd/FetchFromCache-12                                                                       41.4ns ± 1%    46.2ns ± 1%   +11.68%  (p=0.008 n=5+5)
FetchAndAdd/FetchAndAddToCache-12                                                                   96.8ns ± 2%   107.5ns ± 2%   +11.12%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/Forwarded_ipv4-12                                                                      1.07µs ±19%    0.82µs ±25%   -23.49%  (p=0.032 n=5+5)
ContextResetContentEncoding/empty-12                                                                 118ns ± 1%     131ns ± 0%   +11.31%  (p=0.008 n=5+5)
ContextResetContentEncoding/uncompressed-12                                                          150ns ± 1%     167ns ± 1%   +11.34%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       2.01µs ±25%    1.14µs ±23%   -43.30%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  6.48µs ±23%    3.14µs ± 4%   -51.61%  (p=0.008 n=5+5)
ModelIndexer/BestCompression-12                                                                     10.9µs ±42%     5.6µs ± 5%   -48.38%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/invalid-metadata-2.ndjson-12                                                       8.04µs ± 4%    6.62µs ±10%   -17.71%  (p=0.008 n=5+5)
BackendProcessor/invalid-metadata.ndjson-12                                                         7.87µs ± 6%    6.76µs ±31%   -14.08%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12            3.82µs ± 3%    4.12µs ± 5%    +7.92%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-metadata-2.ndjson-12            1.61µs ± 3%    1.69µs ± 3%    +4.71%  (p=0.048 n=5+4)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata-null-values.ndjson-12          2.65µs ± 2%    2.75µs ± 3%    +3.79%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata.ndjson-12                      4.47µs ± 3%    4.91µs ± 7%    +9.73%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/minimal.ndjson-12                       4.09µs ± 9%    4.49µs ± 5%    +9.67%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/unknown-span-type.ndjson-12             15.0µs ± 6%    13.7µs ± 2%    -8.79%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                    2.39µs ± 1%    2.42µs ± 1%    +1.23%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12            10.7µs ± 1%    11.1µs ± 2%    +3.46%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12        1.97µs ± 1%    2.02µs ± 1%    +2.56%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12             7.20µs ± 1%    7.30µs ± 1%    +1.37%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                      6.87µs ± 1%    7.38µs ± 1%    +7.44%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                      13.0µs ± 1%    13.1µs ± 0%    +1.17%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12       1.86µs ± 1%    1.88µs ± 1%    +1.40%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12           468ns ± 2%     489ns ± 1%    +4.36%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12      1.57µs ± 1%    1.59µs ± 1%    +1.15%  (p=0.016 n=5+5)
ReadBatch/metadata.ndjson-12                                                                        19.9µs ± 9%    14.9µs ±27%   -25.21%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
AggregateSpan-12                                                                                    1.07µs ±25%    1.38µs ± 6%   +28.45%  (p=0.016 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/json_codec-12                                                                      4.16µs ±19%   10.45µs ±59%  +150.99%  (p=0.008 n=5+5)
WriteTransaction/json_codec_big_tx-12                                                               4.83µs ± 3%   11.29µs ±76%  +134.08%  (p=0.008 n=5+5)
ReadEvents/json_codec/0_events-12                                                                    314ns ± 8%     354ns ± 3%   +12.67%  (p=0.008 n=5+5)
ReadEvents/json_codec_big_tx/0_events-12                                                             307ns ± 4%     337ns ± 5%   +10.03%  (p=0.008 n=5+5)
ReadEvents/nop_codec/0_events-12                                                                     314ns ± 2%     348ns ±10%   +10.95%  (p=0.008 n=5+5)
ReadEvents/nop_codec_big_tx/0_events-12                                                              307ns ± 6%     337ns ± 3%    +9.63%  (p=0.016 n=5+5)
IsTraceSampled/sampled-12                                                                           66.9ns ± 2%    79.8ns ± 5%   +19.30%  (p=0.008 n=5+5)
IsTraceSampled/unsampled-12                                                                         69.2ns ± 1%    76.5ns ± 2%   +10.66%  (p=0.008 n=5+5)
IsTraceSampled/unknown-12                                                                            391ns ± 3%     427ns ± 3%    +9.06%  (p=0.008 n=5+5)

name                                                                                              old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       3.31kB ± 2%    2.98kB ± 0%    -9.79%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  2.65kB ± 2%    2.59kB ± 1%    -2.26%  (p=0.048 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors_rum.ndjson-12                                                               7.90kB ± 2%    7.72kB ± 1%    -2.24%  (p=0.032 n=5+5)
BackendProcessor/transactions.ndjson-12                                                             26.6kB ± 1%    26.2kB ± 1%    -1.28%  (p=0.008 n=5+5)
BackendProcessor/transactions_spans_rum_2.ndjson-12                                                 5.55kB ± 1%    5.45kB ± 2%    -1.78%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12            3.13kB ± 2%    3.09kB ± 1%    -1.40%  (p=0.024 n=5+5)
ReadBatch/errors.ndjson-12                                                                          20.6kB ± 0%    20.6kB ± 0%    -0.08%  (p=0.040 n=5+5)
ReadBatch/invalid-event-type.ndjson-12                                                              2.64kB ± 0%    2.65kB ± 0%    +0.04%  (p=0.016 n=4+5)
ReadBatch/otel-bridge.ndjson-12                                                                     10.0kB ± 0%    10.0kB ± 0%    +0.08%  (p=0.024 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
WriteTransaction/json_codec_big_tx-12                                                               3.90kB ± 0%    3.90kB ± 0%    +0.03%  (p=0.029 n=4+4)
ReadEvents/json_codec/199_events-12                                                                 1.10MB ± 0%    1.10MB ± 0%    -0.03%  (p=0.040 n=5+5)
ReadEvents/json_codec_big_tx/1000_events-12                                                         4.88MB ± 0%    4.87MB ± 0%    -0.26%  (p=0.032 n=5+5)

name                                                                                              old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/heavy.ndjson-12                                                                     22.3k ± 0%     22.3k ± 0%    -0.01%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed      delta
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/invalid-metadata-2.ndjson-12                                                     54.2MB/s ± 4%  66.2MB/s ±10%   +22.03%  (p=0.008 n=5+5)
BackendProcessor/invalid-metadata.ndjson-12                                                       56.8MB/s ± 5%  68.3MB/s ±41%   +20.37%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12           154MB/s ± 3%   143MB/s ± 6%    -7.25%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata-null-values.ndjson-12         198MB/s ± 2%   191MB/s ± 3%    -3.64%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/metadata.ndjson-12                     278MB/s ± 3%   254MB/s ± 7%    -8.62%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/minimal.ndjson-12                      253MB/s ± 9%   230MB/s ± 5%    -8.97%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/unknown-span-type.ndjson-12            221MB/s ± 7%   242MB/s ± 2%    +9.47%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                   795MB/s ± 1%   785MB/s ± 1%    -1.21%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12           545MB/s ± 1%   526MB/s ± 2%    -3.34%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12       588MB/s ± 1%   573MB/s ± 1%    -2.51%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12            459MB/s ± 1%   453MB/s ± 1%    -1.34%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                     924MB/s ± 1%   860MB/s ± 1%    -6.92%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                     574MB/s ± 1%   567MB/s ± 0%    -1.16%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12      240MB/s ± 1%   237MB/s ± 1%    -1.36%  (p=0.024 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12         932MB/s ± 2%   893MB/s ± 1%    -4.19%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12     735MB/s ± 1%   727MB/s ± 1%    -1.12%  (p=0.016 n=5+5)
ReadBatch/metadata.ndjson-12                                                                      62.6MB/s ±10%  85.0MB/s ±23%   +35.75%  (p=0.032 n=5+5)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

@marclop
Copy link
Contributor Author

marclop commented Oct 10, 2022

I think the channel lockless approach has a race condition which causes the active indexer to never be flushed on the idle timer. I haven't managed to track down where/how it happens, so it may be best to leave that approach as a future optimization and update this PR to use the local mutex, which seems to be equal performance wise.

Signed-off-by: Marc Lopez Rubio <[email protected]>
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! This is starting to look a bit like the APM Go Agent's tracer loop :)

internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
internal/model/modelindexer/indexer.go Outdated Show resolved Hide resolved
Signed-off-by: Marc Lopez Rubio <[email protected]>
@marclop marclop requested a review from axw October 11, 2022 04:11
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Might be worth running the benchmarks one more time, after the changes?

@marclop
Copy link
Contributor Author

marclop commented Oct 11, 2022

@axw benchmarks look good however I do notice an increase in overall RSS which could be problematic (and is) for 1GB APM Servers, and cause them to run out of memory.

Bear in mind that I have done this testing with a slightly smaller ES deployment (6 nodes instead of 12, totaling 116g, rather than 232g). Even then, we've got a ~20% performance increase in all sizes except 1gb and 2GB` which is impressive.

2GB

$ name          old time/op                  new time/op                  delta
AgentAll-128                   2.18s ± 0%                   2.13s ± 0%    -2.20%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-128                   8.28k ± 0%                   8.46k ± 0%    +2.21%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-128                     146 ± 8%                     198 ±21%   +35.70%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-128                    896M ± 1%                   1059M ± 6%   +18.24%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-128                   8.24M ± 1%                   9.49M ± 4%   +15.17%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-128                    965M ± 1%                   1167M ± 5%   +20.98%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-128                   558MB ± 0%                   560MB ± 0%    +0.37%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-128                   7.93M ± 0%                   7.94M ± 0%    +0.10%  (p=0.100 n=3+3)

4GB

$ name          old time/op                  new time/op                  delta
AgentAll-256                   1.34s ± 1%                   1.12s ± 0%   -16.57%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-256                   13.4k ± 1%                   16.1k ± 0%   +19.87%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-256                     176 ± 8%                     250 ± 4%   +41.97%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-256                    914M ± 2%                   1096M ± 0%   +19.95%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-256                   8.14M ± 1%                   9.81M ± 1%   +20.46%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-256                    987M ± 1%                   1195M ± 1%   +21.10%  (p=0.100 n=3+3)

name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-256                    5.78 ± 1%                   19.82 ± 0%  +242.65%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-256                   566MB ± 0%                   560MB ± 0%    -1.06%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-256                   8.02M ± 0%                   7.94M ± 0%    -0.91%  (p=0.100 n=3+3)

8GB

$ name          old time/op                  new time/op                  delta
AgentAll-512                   673ms ± 0%                   566ms ± 8%   -15.94%  (p=0.100 n=3+3)

name          old events/sec               new events/sec               delta
AgentAll-512                   26.8k ± 0%                   32.0k ± 8%   +19.26%  (p=0.100 n=3+3)

name          old max_goroutines           new max_goroutines           delta
AgentAll-512                     284 ± 1%                     434 ± 6%   +52.93%  (p=0.100 n=3+3)

name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-512                    936M ± 1%                   1190M ± 2%   +27.12%  (p=0.100 n=3+3)

name          old max_heap_objects         new max_heap_objects         delta
AgentAll-512                   7.86M ± 2%                  10.50M ± 2%   +33.57%  (p=0.100 n=3+3)

name          old max_rss                  new max_rss                  delta
AgentAll-512                   1.04G ± 2%                   1.31G ± 2%   +25.60%  (p=0.100 n=3+3)

name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-512                    2.40 ± 1%                    9.59 ±14%  +299.17%  (p=0.100 n=3+3)

name          old alloc/op                 new alloc/op                 delta
AgentAll-512                   562MB ± 0%                   574MB ± 0%    +2.25%  (p=0.100 n=3+3)

name          old allocs/op                new allocs/op                delta
AgentAll-512                   7.98M ± 0%                   8.06M ± 0%    +0.94%  (p=0.100 n=3+3)

In terms of CPU usage, the old indexer only used about 50% CPU on an 8GB APM Server, while the new one goes as high as 67%.

I think we may need to revisit the semaphore size, particularly for smaller sizes and perhaps size it according to the amount of memory available to APM Server.

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes generally look good. The higher memory consumption is concerning, will you address this in more detail in a follow up @marclop ?

@marclop
Copy link
Contributor Author

marclop commented Oct 12, 2022

I suspect that the culprit may be the internal model indexer channel size: bulkItems: make(chan elasticsearch.BulkIndexerItem, 1000),. In smaller instances that is quite a lot of events to buffer, it would probably be better to dynamically size the channel based on memory rather than always having a buffered channel with capacity for 1000 events as they now may take up to 1000 * model.APMEvent.

I think it may be better to reduce that channel size to something much smaller for now and look into dynamic sizing based on memory.

@marclop
Copy link
Contributor Author

marclop commented Oct 12, 2022

After lowering the channel size to 100, the total memory increased is more contained: ~8% for 1G instances, and ~16% for the rest.

$ benchstat -alpha 0.11 sizes/1g-12s-12n-main.txt sizes/1g-6s-6n-final-100bc.txt
name         old time/op                  new time/op                  delta
AgentAll-64                   2.18s ± 0%                   2.17s ± 0%    -0.76%  (p=0.100 n=3+3)

name         old events/sec               new events/sec               delta
AgentAll-64                   8.26k ± 0%                   8.32k ± 0%    +0.73%  (p=0.100 n=3+3)

name         old max_goroutines           new max_goroutines           delta
AgentAll-64                     155 ± 6%                     174 ± 2%   +12.72%  (p=0.100 n=3+3)

name         old max_heap_alloc           new max_heap_alloc           delta
AgentAll-64                    860M ± 0%                    918M ± 1%    +6.77%  (p=0.100 n=3+3)

name         old max_heap_objects         new max_heap_objects         delta
AgentAll-64                   7.73M ± 2%                   8.45M ± 1%    +9.29%  (p=0.100 n=3+3)

name         old max_rss                  new max_rss                  delta
AgentAll-64                    928M ± 1%                   1011M ± 1%    +8.96%  (p=0.100 n=3+3)

name         old mean_available_indexers  new mean_available_indexers  delta
AgentAll-64                    6.98 ± 1%                   22.11 ± 1%  +216.76%  (p=0.100 n=3+3)

name         old alloc/op                 new alloc/op                 delta
AgentAll-64                   559MB ± 0%                   559MB ± 0%      ~     (p=0.700 n=3+3)

name         old allocs/op                new allocs/op                delta
AgentAll-64                   7.93M ± 0%                   7.93M ± 0%    -0.03%  (p=0.100 n=3+3)

I think we should merge this PR, and I'll work on a PR to resize both the channel and the semaphore size based on cgroup (or in its absence, machine) memory limit.

@marclop marclop merged commit 28f259d into elastic:main Oct 12, 2022
@marclop marclop deleted the f/lockless-modelindexer-process-batch branch October 12, 2022 02:16
@axw
Copy link
Member

axw commented Nov 24, 2022

This should be tested as part of #9182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement v8.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move event compression outside of the modelindexer activeMu lock
5 participants