Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

simitt · 2022-09-23T11:29:08Z

From @marclop 's findings:

We should come up with a design that allows high throughput and to communicate back payload problems to the producing agents. Currently, we would still respond with an error if a bulk indexer failed to compress an agent's event, yet it is highly unlikely that the agent or customer is at fault for it. A better strategy would be to log those errors and decouple the time intensive operations from agent requests, since not doing so slows down the entire pipeline. A PoC with autoscaling of active indexers can be found in: https://github.com/marclop/apm-server/tree/vertical-scaling.

Autoscale the number of modelindexers up and down depending on ES and apm agent load.

marclop · 2022-10-13T06:25:55Z

After merging #9318, we have made considerable improvement (~+20% more events), but it seems that we could be processing more events if we had more active indexers pulling from the internal model indexer queue. Based CPU utilization metrics for a 12 hot node 58gb RAM Elasticsearch, and APM Server indices configured to 12 shards, we aren't pushing the underlying Elasticsearch hard enough:

The different distributions are for 1, 2, 4, 8, 15 and 30 gigabytes of RAM APM Servers, in that order.

Looking at the APM Server CPU usage metrics, it also looks like while we use more CPU when available (after the change to dedicated goroutine active indexer), we still aren't taking advantage of bigger instances with more CPUs

Looking at this metrics, it may be that scaling up the active indexers up to GOMAXPROCS / 3 can increase our event processing rate. Autoscaling could be performed using a mix of these metrics:

When certain number of consecutive full flushes occurs (respecting scale up cooldowns).
When a timed flush takes place, we need to downscale an active indexer (respecting scale down cooldown)

I think it is a good place to start for autoscaling and keep it simple.

Afterwards we could use other metrics to fine tune how autoscaling behaves:

How full the model indexer channel is ( len(chan) / cap(chan) = decimal utilization ) and also look into using that as a pressure indicator.
If Elasticsearch has responded to the bulk requests with 429s or 409s in the last Time, do not scale up and perhaps consider scaling down if the scale down cooldown permits it.

axw · 2022-12-06T01:27:12Z

To be tested as part of #9182

simitt added the enhancement label Sep 23, 2022

simitt added this to the 8.6 milestone Sep 23, 2022

simitt assigned marclop Sep 23, 2022

simitt mentioned this issue Sep 23, 2022

Ensure APM Server makes full usage of available CPU resources #9182

Closed

2 tasks

simitt added the v8.6.0 label Oct 4, 2022

marclop mentioned this issue Oct 18, 2022

modelindexer: Scale active indexers based on load #9393

Merged

2 tasks

marclop closed this as completed in #9393 Oct 26, 2022

marclop mentioned this issue Oct 28, 2022

modelindexer: disable scale up when 429 > 1% #9463

Merged

1 task

simitt mentioned this issue Oct 31, 2022

modelidexer: Dynamically size the number of available bulk indexers #7024

Closed

cmacknz mentioned this issue Nov 10, 2022

Explore an autoscaling Elasticsearch output based on the APM Server autoscaling work elastic/elastic-agent-shipper#175

Open

marclop added the test-plan label Nov 14, 2022

axw unassigned marclop Nov 21, 2022

axw mentioned this issue Nov 24, 2022

beater: Set semaphore, modelindexer chan from mem #9358

Merged

1 task

axw removed the test-plan label Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

simitt commented Sep 23, 2022

marclop commented Oct 13, 2022 •

edited

Loading

axw commented Dec 6, 2022

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

Comments

simitt commented Sep 23, 2022

marclop commented Oct 13, 2022 • edited Loading

axw commented Dec 6, 2022

marclop commented Oct 13, 2022 •

edited

Loading