Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report output health #3127

Merged
merged 13 commits into from
Dec 6, 2023
Merged

report output health #3127

merged 13 commits into from
Dec 6, 2023

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Nov 29, 2023

What is the problem this PR solves?

Report state of remote es outputs

How does this PR solve the problem?

Report HEALTHY/DEGRADED state of remote es outputs to logs-fleet_server.output_health-default.

  • report state when api key is being generated
  • report state regularly in self monitor by pinging the remote es hosts

How to test this PR locally

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Resolves #3116

@@ -253,6 +255,32 @@ func (m *selfMonitorT) updateState(ctx context.Context) (client.UnitState, error
return state, nil
}

func reportOutputHealth(ctx context.Context, bulker bulk.Bulk, logger zerolog.Logger) {
//pinging logic
bulkerMap := bulker.GetBulkerMap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned on the previous pr, the regular health reporting will stop if fleet-server is restarted, and doesn't restart until an agent tries to create an API key again (e.g. due to change in output config), because the bulkerMap is stored in memory and output bulkers are created when there is a config change or a new output used for the first time by an agent.

@juliaElastic juliaElastic marked this pull request as ready for review November 30, 2023 10:36
@juliaElastic juliaElastic requested a review from a team as a code owner November 30, 2023 10:36
@@ -218,6 +218,8 @@ func (m *selfMonitorT) updateState(ctx context.Context) (client.UnitState, error
return client.UnitStateStarting, nil
}

reportOutputHealth(ctx, m.bulker, m.log)
Copy link
Contributor Author

@juliaElastic juliaElastic Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently pinging remote outputs every 5s (default monitor interval) and writing out a doc to the output health data stream.
We could change this to only write out a doc if the state changed.

Comment on lines 22 to 28
type OutputHealth struct {
Output string `json:"output,omitempty"`
State string `json:"state,omitempty"`
Message string `json:"message,omitempty"`
Timestamp string `json:"@timestamp,omitempty"`
DataStream DataStream `json:"data_stream,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is something that is being written to ES, does it make more sense to define in in model/schema.json instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to schema.json

Comment on lines +39 to +41
Dataset: "fleet_server.output_health",
Type: "logs",
Namespace: "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be constants? Can Namespace ever be something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be always default.

internal/pkg/policy/policy_output.go Outdated Show resolved Hide resolved
@juliaElastic juliaElastic requested a review from a team December 4, 2023 07:44
juliaElastic added a commit to elastic/kibana that referenced this pull request Dec 5, 2023
## Summary

Closes #104986

Enable feature flags for `remoteESOutput` and `outputSecretsStorage`.

The feature is ready when #172181
and elastic/fleet-server#3127 is merged.

Output secret storage
[issues](#157458) are closed, so
I think the feature flag for that should be enabled too. cc
@jillguyonnet
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

Quality Gate passed Quality Gate passed

The SonarQube Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
60.0% 60.0% Coverage on New Code
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@juliaElastic juliaElastic merged commit c232532 into elastic:main Dec 6, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report remote output health
2 participants