Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load limit on fleet server also limits its own Elastic Agent #1859

Closed
jlind23 opened this issue Sep 13, 2022 · 8 comments · Fixed by #1904
Closed

Load limit on fleet server also limits its own Elastic Agent #1859

jlind23 opened this issue Sep 13, 2022 · 8 comments · Fixed by #1904
Assignees
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Sep 13, 2022

Apparently, the connection limit is applied to Fleet Server own Elastic Agent.
This mean that under an heavy load, the Fleet Server Elastic Agent can be prevented from checkin in and as a result FS is shown as unhealthy.

This issue will act as a placeholder for any further discoveries regarding Fleet Server config.

@jlind23 jlind23 added Team:Fleet Label for the Fleet team Project:FleetScaling labels Sep 13, 2022
@scunningham
Copy link

This was noted a while ago: #524

The Fleet Server should be communicating with the Elastic Agent on a dedicated loopback port. Related code here. Is this not happening?

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 13, 2022

Based on recent tests done by @ablnk and following an analysis done by @joshdover. Fleet server elastic agent is logging "429" errors due to Fleet server limiting the number of connection.

@scunningham
Copy link

I would be curious to know if the Fleet Agent is using the loopback.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 13, 2022

Unfortunately I do not think that we log it.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 13, 2022

@michel-laterman - @michalpristas is already looking at it. We'll ping you as soon as we progress on this

@michalpristas
Copy link
Contributor

michalpristas commented Sep 13, 2022

looks like failure is not coming from max_connection limiter but rather from checkin Limit limiter.
max connection limiters are correctly separated and created per listener.
Limits (such as checkin) are enforced in handler while router is shared in between servers (external for normal traffic, internal for local agent).

in main.go

        ct := api.NewCheckinT(f.verCon, &cfg.Inputs[0].Server, f.cache, bc, pm, am, ad, tr, bulker)
        router := api.NewRouter(ctx, bulker, ct, et, at, ack, st, sm, tracer, f.bi)

	g.Go(loggedRunFunc(ctx, "Http server", func(ctx context.Context) error {
		return api.Run(ctx, router, &cfg.Inputs[0].Server)
	}))

inside handler (internal/pkg/api/handleCheckin.go) we have

ct := &CheckinT{
		verCon: verCon,
		...,
		limit:  limit.NewLimiter(&cfg.Limits.CheckinLimit),
		bulker: bulker,
	}

then this is used in internal/pkg/api/server.go

for _, addr := range listeners {
		server := http.Server{
			Addr:              addr,
			...
			Handler:           router,  <<<
			...,
		}

we need to find a solution to separate these, some are nicer (separate router/limiter per listener) some are faster (apply limit only for external requests)

@scunningham
Copy link

Interesting. Probably not hitting the max-checkin case; but can certainly hit the throttle to prevent check-ins coming to fast.

Cleanest way would be to create a separate router for each interface. Agreed. Adding limit per external request is going to be messy and hard to reason about.

@michel-laterman
Copy link
Contributor

Instead of creating a separate router per listener, which would require creating separate endpoint handers (CheckinT, AckT, etc) just so a different instance of a listener is associated with a route.
I think we should refactor the limiter so it can be used as a route-aware middleware layer on the http server. This way we can just make separate limiters for each listener.
It also would cleanly split checking a limit from the request handling logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants