Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscale number of modelindexers to increase throughput and ensure full resource usage #9181

Closed
Tracked by #9182
simitt opened this issue Sep 23, 2022 · 2 comments · Fixed by #9393
Closed
Tracked by #9182

Comments

@simitt
Copy link
Contributor

simitt commented Sep 23, 2022

From @marclop 's findings:

We should come up with a design that allows high throughput and to communicate back payload problems to the producing agents. Currently, we would still respond with an error if a bulk indexer failed to compress an agent's event, yet it is highly unlikely that the agent or customer is at fault for it. A better strategy would be to log those errors and decouple the time intensive operations from agent requests, since not doing so slows down the entire pipeline. A PoC with autoscaling of active indexers can be found in: https://github.com/marclop/apm-server/tree/vertical-scaling.

Autoscale the number of modelindexers up and down depending on ES and apm agent load.

@marclop
Copy link
Contributor

marclop commented Oct 13, 2022

After merging #9318, we have made considerable improvement (~+20% more events), but it seems that we could be processing more events if we had more active indexers pulling from the internal model indexer queue. Based CPU utilization metrics for a 12 hot node 58gb RAM Elasticsearch, and APM Server indices configured to 12 shards, we aren't pushing the underlying Elasticsearch hard enough:

The different distributions are for 1, 2, 4, 8, 15 and 30 gigabytes of RAM APM Servers, in that order.

image

image

image

Looking at the APM Server CPU usage metrics, it also looks like while we use more CPU when available (after the change to dedicated goroutine active indexer), we still aren't taking advantage of bigger instances with more CPUs

image

Looking at this metrics, it may be that scaling up the active indexers up to GOMAXPROCS / 3 can increase our event processing rate. Autoscaling could be performed using a mix of these metrics:

  • When certain number of consecutive full flushes occurs (respecting scale up cooldowns).
  • When a timed flush takes place, we need to downscale an active indexer (respecting scale down cooldown)

I think it is a good place to start for autoscaling and keep it simple.

Afterwards we could use other metrics to fine tune how autoscaling behaves:

  • How full the model indexer channel is ( len(chan) / cap(chan) = decimal utilization ) and also look into using that as a pressure indicator.
  • If Elasticsearch has responded to the bulk requests with 429s or 409s in the last Time, do not scale up and perhaps consider scaling down if the scale down cooldown permits it.

@axw
Copy link
Member

axw commented Dec 6, 2022

To be tested as part of #9182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants