modelindexer: disable scale up when 429 > 1% #9463

marclop · 2022-10-28T15:49:08Z

Motivation/summary

Disables the scale up actions when the 429 response rate exceeds 1% of the total response rate. Additionally, scale down respecting the scale down parameters when the rate is breached.

Checklist

Update CHANGELOG.asciidoc
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

How to test these changes

Run benchmarks with >= APM Server 8GB against a small ES (8gb) for example.

Related issues

Part of #9181

apmmachine · 2022-10-28T16:17:06Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-11-02T07:45:58.388+0000
Duration: 27 min 28 sec

Test stats 🧪

Test	Results
Failed	0
Passed	153
Skipped	0
Total	153

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

apmmachine · 2022-10-28T16:17:10Z

📚 Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/Remote_Addr_ipv6-12                                                                     771ns ±17%     882ns ±10%  +14.49%  (p=0.032 n=5+5)
ContextReset/Forwarded_ipv4-12                                                                       695ns ±49%     945ns ±16%  +35.90%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
RUMV3Processor/rum_errors.ndjson-12                                                                 8.00µs ±36%    9.60µs ±10%  +20.02%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/unknown-span-type.ndjson-12             20.8µs ±18%    25.3µs ±19%  +21.62%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/optional-timestamps.ndjson-12           3.14µs ± 4%    3.40µs ± 6%   +8.36%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12                     10.6µs ± 3%    11.6µs ± 6%   +9.55%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                         10.8µs ± 1%    10.9µs ± 1%   +0.85%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                    6.50µs ± 4%    6.78µs ± 5%   +4.22%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12           463ns ± 1%     482ns ± 1%   +4.10%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12             468ns ± 2%     490ns ± 2%   +4.63%  (p=0.008 n=5+5)
ReadBatch/errors_rum.ndjson-12                                                                      23.4µs ±36%    33.3µs ± 8%  +42.46%  (p=0.008 n=5+5)
ReadBatch/heavy.ndjson-12                                                                           3.66ms ±18%    4.12ms ± 4%  +12.35%  (p=0.032 n=5+4)
ReadBatch/invalid-event.ndjson-12                                                                   34.7µs ±10%    26.4µs ±30%  -24.05%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
TraceGroups-12                                                                                       122ns ± 2%     144ns ± 0%  +18.15%  (p=0.029 n=4+4)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-event-type.ndjson-12            4.14kB ± 1%    4.22kB ± 1%   +1.86%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/invalid-metadata-2.ndjson-12            3.19kB ± 2%    3.13kB ± 1%   -2.02%  (p=0.008 n=5+5)
ReadBatch/invalid-event.ndjson-12                                                                   6.71kB ± 1%    6.68kB ± 0%   -0.40%  (p=0.040 n=5+5)
ReadBatch/unknown-span-type.ndjson-12                                                               16.8kB ± 0%    16.8kB ± 0%   +0.08%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/heavy.ndjson-12                                                                     22.3k ± 0%     22.3k ± 0%   +0.00%  (p=0.029 n=4+4)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/heavy.ndjson-12                        22.3k ± 0%     22.3k ± 0%   +0.01%  (p=0.029 n=4+4)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed      delta
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
RUMV3Processor/rum_errors.ndjson-12                                                                125MB/s ±49%   100MB/s ± 9%  -20.13%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/unknown-span-type.ndjson-12            160MB/s ±16%   132MB/s ±17%  -17.63%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/optional-timestamps.ndjson-12          327MB/s ± 4%   302MB/s ± 7%   -7.67%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/ratelimit.ndjson-12                    399MB/s ± 3%   365MB/s ± 6%   -8.58%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                        742MB/s ± 1%   736MB/s ± 1%   -0.84%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                   725MB/s ± 4%   696MB/s ± 5%   -4.04%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12         941MB/s ± 1%   904MB/s ± 1%   -3.94%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12           953MB/s ± 2%   911MB/s ± 2%   -4.43%  (p=0.008 n=5+5)
ReadBatch/errors_rum.ndjson-12                                                                    87.3MB/s ±46%  57.1MB/s ± 9%  -34.53%  (p=0.008 n=5+5)
ReadBatch/heavy.ndjson-12                                                                          110MB/s ±21%    97MB/s ± 4%  -11.89%  (p=0.032 n=5+4)
ReadBatch/invalid-event.ndjson-12                                                                 22.2MB/s ± 9%  30.1MB/s ±38%  +35.64%  (p=0.008 n=5+5)
ReadBatch/transactions_spans_rum.ndjson-12                                                        67.0MB/s ±13%  50.7MB/s ± 4%  -24.33%  (p=0.016 n=5+4)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Disables the scale up actions when the 429 response rate exceeds 1% of the total response rate. Additionally, scale down respecting the scale down parameters when the rate is breached. Signed-off-by: Marc Lopez Rubio <[email protected]>

…er-percentage

axw

LGTM!

Did you consider scaling down based on any failures, rather than just 429s?

marclop · 2022-11-02T09:26:46Z

@axw I did, however, "other" failures could be due to malformed documents (bad mappings, values, etc), so I decided not to for the time being.

The other status codes we may consider is looking for 499 (client timeouts?), 502 and 503? That however, we can do on a follow up PR, perhaps since I haven't tested it, and we're not collecting those already.

axw · 2022-12-05T14:59:52Z

Verified with 8.6.0-BC5, running on a GCP VM. I pointed it at an ESS cluster's Elasticsearch; scaled up ES and waited for APM Server to scale up the indexers; then scaled down ES an observed APM Server scale down too.

marclop added enhancement backport-skip Skip notification from the automated backport with mergify v8.6.0 labels Oct 28, 2022

modelindexer: disable scale up when 429 > 1%

abf0a77

Disables the scale up actions when the 429 response rate exceeds 1% of the total response rate. Additionally, scale down respecting the scale down parameters when the rate is breached. Signed-off-by: Marc Lopez Rubio <[email protected]>

marclop force-pushed the f/do-not-scale-up-if-tooMany-request-rate-over-percentage branch from b6e9004 to abf0a77 Compare October 31, 2022 12:59

marclop marked this pull request as ready for review November 2, 2022 07:45

Merge branch 'main' into f/do-not-scale-up-if-tooMany-request-rate-ov…

26b34ec

…er-percentage

marclop requested a review from a team November 2, 2022 07:46

axw approved these changes Nov 2, 2022

View reviewed changes

marclop merged commit 89c17ff into elastic:main Nov 2, 2022

marclop deleted the f/do-not-scale-up-if-tooMany-request-rate-over-percentage branch November 2, 2022 09:27

marclop mentioned this pull request Nov 4, 2022

Explore using more Elasticsearch response codes to pause or stop model indexer autoscaling #9511

Open

marclop added the test-plan label Nov 14, 2022

axw self-assigned this Dec 2, 2022

axw added the test-plan-ok label Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modelindexer: disable scale up when 429 > 1% #9463

modelindexer: disable scale up when 429 > 1% #9463

marclop commented Oct 28, 2022 •

edited

Loading

apmmachine commented Oct 28, 2022 •

edited

Loading

Build stats

Test stats 🧪

apmmachine commented Oct 28, 2022 •

edited

Loading

axw left a comment

marclop commented Nov 2, 2022

axw commented Dec 5, 2022

modelindexer: disable scale up when 429 > 1% #9463

modelindexer: disable scale up when 429 > 1% #9463

Conversation

marclop commented Oct 28, 2022 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

apmmachine commented Oct 28, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

apmmachine commented Oct 28, 2022 • edited Loading

📚 Go benchmark report

axw left a comment

Choose a reason for hiding this comment

marclop commented Nov 2, 2022

axw commented Dec 5, 2022

marclop commented Oct 28, 2022 •

edited

Loading

apmmachine commented Oct 28, 2022 •

edited

Loading

apmmachine commented Oct 28, 2022 •

edited

Loading