-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modelindexer: Run active in dedicated goroutine #9318
modelindexer: Run active in dedicated goroutine #9318
Conversation
Remove the `i.activeMu` from the modelindexer and shared sync primitives from the indexer structure, instead, run the active indexer in its own goroutine, minimizing the amount of shared locking that needs to happen when `ProcessBatch` is called. To allow (potentially) multiple active indexers from running in parallel, a new internal `bulkItems` channel is introduced, with a variable buffer depending on how powerful the machine where APM Server is run. The bigger the machine, the bigger the channel buffer, and vice versa. Additionally, the default `FlushBytes` reduced from `5MB` to `2MB`, and to keep the same total "bufferable" size, the default `MaxRequests` is increased to 25 (25 * 2mb = 50mb). The motivation behind this change is is twofold; Allowing the modelindexer to have more background flushes if enough events are being processed by the APM Server, and, creating slightly smaller bulk requests to Elasticsearch, since, with the default settings (5MB FlushBytes), the number of bulk documents is ~`18-24K`. Initial micro-benchmarks indicate that the performance improvement is ~42% with default settings, and macro-benchmarks ~27%: ``` benchstat old.txt single-lockless.txt name old time/op new time/op delta NoCompression-8 1.11µs ± 1% 1.15µs ± 4% +4.41% (p=0.008 n=5+5) BestSpeed-8 2.03µs ± 3% 1.88µs ± 1% -7.14% (p=0.008 n=5+5) DefaultCompression-8 4.10µs ± 2% 2.36µs ± 2% -42.32% (p=0.008 n=5+5) BestCompression-8 6.95µs ± 1% 4.21µs ± 1% -39.48% (p=0.008 n=5+5) name old alloc/op new alloc/op delta NoCompression-8 2.07kB ± 0% 2.14kB ± 0% +3.44% (p=0.016 n=4+5) BestSpeed-8 2.57kB ± 0% 2.56kB ± 0% -0.31% (p=0.008 n=5+5) DefaultCompression-8 2.55kB ± 0% 2.52kB ± 0% -0.87% (p=0.008 n=5+5) BestCompression-8 2.58kB ± 0% 2.57kB ± 0% -0.52% (p=0.008 n=5+5) ``` ``` name old events/sec new events/sec delta AgentAll-512 26.8k ± 0% 34.1k ±11% +27.11% (p=0.100 n=3+3) ``` Signed-off-by: Marc Lopez Rubio <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>
I've included two commits of the implementation, the first one uses a local $ benchstat old.txt single.txt
name old time/op new time/op delta
ModelIndexer/NoCompression-8 1.11µs ± 1% 1.19µs ±15% +7.61% (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8 2.03µs ± 3% 1.89µs ± 2% -6.91% (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8 4.10µs ± 2% 2.34µs ± 1% -42.98% (p=0.008 n=5+5)
ModelIndexer/BestCompression-8 6.95µs ± 1% 4.20µs ± 1% -39.58% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
ModelIndexer/NoCompression-8 2.07kB ± 0% 2.14kB ± 0% +3.30% (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8 2.57kB ± 0% 2.56kB ± 0% -0.35% (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8 2.55kB ± 0% 2.52kB ± 0% -0.83% (p=0.008 n=5+5)
ModelIndexer/BestCompression-8 2.58kB ± 0% 2.56kB ± 0% -0.62% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
ModelIndexer/NoCompression-8 20.0 ± 0% 20.0 ± 0% ~ (all equal)
ModelIndexer/BestSpeed-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/DefaultCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/BestCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal) The second one implements the same "algorithm" but without using channels instead of a mutex. It seems to have comparable performance: (ef72d77): $ benchstat old.txt single-lockless.txt
name old time/op new time/op delta
ModelIndexer/NoCompression-8 1.11µs ± 1% 1.15µs ± 4% +4.41% (p=0.008 n=5+5)
ModelIndexer/BestSpeed-8 2.03µs ± 3% 1.88µs ± 1% -7.14% (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8 4.10µs ± 2% 2.36µs ± 2% -42.32% (p=0.008 n=5+5)
ModelIndexer/BestCompression-8 6.95µs ± 1% 4.21µs ± 1% -39.48% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
ModelIndexer/NoCompression-8 2.07kB ± 0% 2.14kB ± 0% +3.44% (p=0.016 n=4+5)
ModelIndexer/BestSpeed-8 2.57kB ± 0% 2.56kB ± 0% -0.31% (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-8 2.55kB ± 0% 2.52kB ± 0% -0.87% (p=0.008 n=5+5)
ModelIndexer/BestCompression-8 2.58kB ± 0% 2.57kB ± 0% -0.52% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
ModelIndexer/NoCompression-8 20.0 ± 0% 20.0 ± 0% ~ (all equal)
ModelIndexer/BestSpeed-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/DefaultCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/BestCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal) $ benchstat single.txt single-lockless.txt
name old time/op new time/op delta
ModelIndexer/NoCompression-8 1.19µs ±15% 1.15µs ± 4% ~ (p=0.841 n=5+5)
ModelIndexer/BestSpeed-8 1.89µs ± 2% 1.88µs ± 1% ~ (p=0.889 n=5+5)
ModelIndexer/DefaultCompression-8 2.34µs ± 1% 2.36µs ± 2% ~ (p=0.548 n=5+5)
ModelIndexer/BestCompression-8 4.20µs ± 1% 4.21µs ± 1% ~ (p=0.690 n=5+5)
name old alloc/op new alloc/op delta
ModelIndexer/NoCompression-8 2.14kB ± 0% 2.14kB ± 0% ~ (p=0.579 n=5+5)
ModelIndexer/BestSpeed-8 2.56kB ± 0% 2.56kB ± 0% ~ (p=0.667 n=5+5)
ModelIndexer/DefaultCompression-8 2.52kB ± 0% 2.52kB ± 0% ~ (p=0.373 n=5+5)
ModelIndexer/BestCompression-8 2.56kB ± 0% 2.57kB ± 0% ~ (p=0.905 n=5+5)
name old allocs/op new allocs/op delta
ModelIndexer/NoCompression-8 20.0 ± 0% 20.0 ± 0% ~ (all equal)
ModelIndexer/BestSpeed-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/DefaultCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal)
ModelIndexer/BestCompression-8 24.0 ± 0% 24.0 ± 0% ~ (all equal) |
Signed-off-by: Marc Lopez Rubio <[email protected]>
📚 Go benchmark reportDiff with the
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
This reverts commit ef72d77.
I think the channel lockless approach has a race condition which causes the active indexer to never be flushed on the idle timer. I haven't managed to track down where/how it happens, so it may be best to leave that approach as a future optimization and update this PR to use the local mutex, which seems to be equal performance wise. |
Signed-off-by: Marc Lopez Rubio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! This is starting to look a bit like the APM Go Agent's tracer loop :)
Signed-off-by: Marc Lopez Rubio <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>
Signed-off-by: Marc Lopez Rubio <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Might be worth running the benchmarks one more time, after the changes?
@axw benchmarks look good however I do notice an increase in overall RSS which could be problematic (and is) for 1GB APM Servers, and cause them to run out of memory. Bear in mind that I have done this testing with a slightly smaller ES deployment (6 nodes instead of 12, totaling 2GB$ name old time/op new time/op delta
AgentAll-128 2.18s ± 0% 2.13s ± 0% -2.20% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-128 8.28k ± 0% 8.46k ± 0% +2.21% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-128 146 ± 8% 198 ±21% +35.70% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-128 896M ± 1% 1059M ± 6% +18.24% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-128 8.24M ± 1% 9.49M ± 4% +15.17% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-128 965M ± 1% 1167M ± 5% +20.98% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-128 558MB ± 0% 560MB ± 0% +0.37% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-128 7.93M ± 0% 7.94M ± 0% +0.10% (p=0.100 n=3+3) 4GB$ name old time/op new time/op delta
AgentAll-256 1.34s ± 1% 1.12s ± 0% -16.57% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-256 13.4k ± 1% 16.1k ± 0% +19.87% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-256 176 ± 8% 250 ± 4% +41.97% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-256 914M ± 2% 1096M ± 0% +19.95% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-256 8.14M ± 1% 9.81M ± 1% +20.46% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-256 987M ± 1% 1195M ± 1% +21.10% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-256 5.78 ± 1% 19.82 ± 0% +242.65% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-256 566MB ± 0% 560MB ± 0% -1.06% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-256 8.02M ± 0% 7.94M ± 0% -0.91% (p=0.100 n=3+3) 8GB$ name old time/op new time/op delta
AgentAll-512 673ms ± 0% 566ms ± 8% -15.94% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-512 26.8k ± 0% 32.0k ± 8% +19.26% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-512 284 ± 1% 434 ± 6% +52.93% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-512 936M ± 1% 1190M ± 2% +27.12% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-512 7.86M ± 2% 10.50M ± 2% +33.57% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-512 1.04G ± 2% 1.31G ± 2% +25.60% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-512 2.40 ± 1% 9.59 ±14% +299.17% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-512 562MB ± 0% 574MB ± 0% +2.25% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-512 7.98M ± 0% 8.06M ± 0% +0.94% (p=0.100 n=3+3) In terms of CPU usage, the old indexer only used about 50% CPU on an 8GB APM Server, while the new one goes as high as 67%. I think we may need to revisit the semaphore size, particularly for smaller sizes and perhaps size it according to the amount of memory available to APM Server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes generally look good. The higher memory consumption is concerning, will you address this in more detail in a follow up @marclop ?
I suspect that the culprit may be the internal model indexer channel size: I think it may be better to reduce that channel size to something much smaller for now and look into dynamic sizing based on memory. |
Signed-off-by: Marc Lopez Rubio <[email protected]>
After lowering the channel size to $ benchstat -alpha 0.11 sizes/1g-12s-12n-main.txt sizes/1g-6s-6n-final-100bc.txt
name old time/op new time/op delta
AgentAll-64 2.18s ± 0% 2.17s ± 0% -0.76% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-64 8.26k ± 0% 8.32k ± 0% +0.73% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-64 155 ± 6% 174 ± 2% +12.72% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-64 860M ± 0% 918M ± 1% +6.77% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-64 7.73M ± 2% 8.45M ± 1% +9.29% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-64 928M ± 1% 1011M ± 1% +8.96% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-64 6.98 ± 1% 22.11 ± 1% +216.76% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-64 559MB ± 0% 559MB ± 0% ~ (p=0.700 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-64 7.93M ± 0% 7.93M ± 0% -0.03% (p=0.100 n=3+3) I think we should merge this PR, and I'll work on a PR to resize both the channel and the semaphore size based on |
This should be tested as part of #9182 |
Motivation/summary
Remove the
i.activeMu
from the modelindexer and shared sync primitivesfrom the indexer structure, instead, run the active indexer in its own
goroutine, minimizing the amount of shared locking that needs to happen
when
ProcessBatch
is called. To allow (potentially) multiple activeindexers from running in parallel, a new internal
bulkItems
channel isintroduced, with a variable buffer depending on how powerful the machine
where APM Server is run. The bigger the machine, the bigger the channel
buffer, and vice versa.
Additionally, the default
FlushBytes
reduced from5MB
to2MB
, andto keep the same total "bufferable" size, the default
MaxRequests
isincreased to 25 (25 * 2mb = 50mb). The motivation behind this change is
is twofold; Allowing the modelindexer to have more background flushes
if enough events are being processed by the APM Server, and, creating
slightly smaller bulk requests to Elasticsearch, since, with the default
settings (5MB FlushBytes), the number of bulk documents is ~
18-24K
.Initial micro-benchmarks indicate that the performance improvement is
~42-50% with default settings, and macro-benchmarks ~27%:
Checklist
- [ ] Update package changelog.yml (only if changes toapmpackage
have been made)- [ ] Documentation has been updatedHow to test these changes
TODO
Related issues
Closes #9180