Load limit on fleet server also limits its own Elastic Agent #1859

jlind23 · 2022-09-13T12:42:29Z

Apparently, the connection limit is applied to Fleet Server own Elastic Agent.
This mean that under an heavy load, the Fleet Server Elastic Agent can be prevented from checkin in and as a result FS is shown as unhealthy.

This issue will act as a placeholder for any further discoveries regarding Fleet Server config.

scunningham · 2022-09-13T12:53:31Z

This was noted a while ago: #524

The Fleet Server should be communicating with the Elastic Agent on a dedicated loopback port. Related code here. Is this not happening?

jlind23 · 2022-09-13T12:57:47Z

Based on recent tests done by @ablnk and following an analysis done by @joshdover. Fleet server elastic agent is logging "429" errors due to Fleet server limiting the number of connection.

scunningham · 2022-09-13T13:07:41Z

I would be curious to know if the Fleet Agent is using the loopback.

jlind23 · 2022-09-13T13:22:04Z

Unfortunately I do not think that we log it.

jlind23 · 2022-09-13T14:44:05Z

@michel-laterman - @michalpristas is already looking at it. We'll ping you as soon as we progress on this

michalpristas · 2022-09-13T15:13:44Z

looks like failure is not coming from max_connection limiter but rather from checkin Limit limiter.
max connection limiters are correctly separated and created per listener.
Limits (such as checkin) are enforced in handler while router is shared in between servers (external for normal traffic, internal for local agent).

in main.go

        ct := api.NewCheckinT(f.verCon, &cfg.Inputs[0].Server, f.cache, bc, pm, am, ad, tr, bulker)
        router := api.NewRouter(ctx, bulker, ct, et, at, ack, st, sm, tracer, f.bi)

	g.Go(loggedRunFunc(ctx, "Http server", func(ctx context.Context) error {
		return api.Run(ctx, router, &cfg.Inputs[0].Server)
	}))

inside handler (internal/pkg/api/handleCheckin.go) we have

ct := &CheckinT{
		verCon: verCon,
		...,
		limit:  limit.NewLimiter(&cfg.Limits.CheckinLimit),
		bulker: bulker,
	}

then this is used in internal/pkg/api/server.go

for _, addr := range listeners {
		server := http.Server{
			Addr:              addr,
			...
			Handler:           router,  <<<
			...,
		}

we need to find a solution to separate these, some are nicer (separate router/limiter per listener) some are faster (apply limit only for external requests)

scunningham · 2022-09-13T15:37:00Z

Interesting. Probably not hitting the max-checkin case; but can certainly hit the throttle to prevent check-ins coming to fast.

Cleanest way would be to create a separate router for each interface. Agreed. Adding limit per external request is going to be messy and hard to reason about.

michel-laterman · 2022-09-21T19:31:47Z

Instead of creating a separate router per listener, which would require creating separate endpoint handers (CheckinT, AckT, etc) just so a different instance of a listener is associated with a route.
I think we should refactor the limiter so it can be used as a route-aware middleware layer on the http server. This way we can just make separate limiters for each listener.
It also would cleanly split checking a limit from the request handling logic.

jlind23 added Team:Fleet Label for the Fleet team Project:FleetScaling labels Sep 13, 2022

jen-huang assigned michel-laterman Sep 13, 2022

michalpristas assigned michalpristas and unassigned michel-laterman Sep 13, 2022

jlind23 assigned michel-laterman Sep 13, 2022

amitkanfer unassigned michalpristas Sep 15, 2022

michel-laterman mentioned this issue Sep 22, 2022

Unique limiters for each API listener #1904

Merged

3 tasks

michel-laterman closed this as completed in #1904 Sep 29, 2022

dedemorton mentioned this issue Oct 18, 2022

Add Fleet/Agent release notes for 8.5 elastic/observability-docs#2264

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load limit on fleet server also limits its own Elastic Agent #1859

Load limit on fleet server also limits its own Elastic Agent #1859

jlind23 commented Sep 13, 2022

scunningham commented Sep 13, 2022

jlind23 commented Sep 13, 2022

scunningham commented Sep 13, 2022

jlind23 commented Sep 13, 2022

jlind23 commented Sep 13, 2022

michalpristas commented Sep 13, 2022 •

edited

Loading

scunningham commented Sep 13, 2022

michel-laterman commented Sep 21, 2022

Load limit on fleet server also limits its own Elastic Agent #1859

Load limit on fleet server also limits its own Elastic Agent #1859

Comments

jlind23 commented Sep 13, 2022

scunningham commented Sep 13, 2022

jlind23 commented Sep 13, 2022

scunningham commented Sep 13, 2022

jlind23 commented Sep 13, 2022

jlind23 commented Sep 13, 2022

michalpristas commented Sep 13, 2022 • edited Loading

scunningham commented Sep 13, 2022

michel-laterman commented Sep 21, 2022

michalpristas commented Sep 13, 2022 •

edited

Loading