Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Monitoring Of Filebeat With New Metrics #33250

Open
aveuiller opened this issue Oct 4, 2022 · 4 comments
Open

Improve Monitoring Of Filebeat With New Metrics #33250

aveuiller opened this issue Oct 4, 2022 · 4 comments
Labels
needs_team Indicates that the issue/PR needs a Team:* label Stalled

Comments

@aveuiller
Copy link

Hello,

Describe the enhancement

We are currently using custom methods to fetch some metrics that are important to have a view on the stability of Filebeat, as I mentioned in #33206.

We would like to see those metrics integrated natively. This would greatly simplify our workflow, and uniformize data collection for Filebeat instances both on baremetal and kubernetes pods.

The proposed enhancement is composed of 3 features that improve visibility on the state of Filebeat. The main point is to be able to tell if Filebeat is working as expected.

Describe a specific use case for the enhancement or feature:

In this section I will describe each metric and the integration we aim for them. The final use case is to integrate those new metrics into our alerting systems to react quickly to any bad state.

New Feature: Hearbeat

First of all, we currently have a cron sending messages to a log file every x minutes. This log file is tailed by Filebeat and the event sent to our infrastructure.
This gives us a good overview on the log collection status, by ensuring that logs flows continously. However, it currently requires external components.

We would love to see that directly handled by Filebeat, activated through the configuration for instance.

New Metric: Last Registry Update Time

Following an incident with a stalled Filebeat that was still attempting to send data, a non-updated registry seems to be a good indicator of a bad state that should be investigated ASAP.

We are currently retrieving the last update time through the command stat -c %Z /var/lib/filebeat/registry/filebeat/log.json, exported once again by custom tools.

Once again, having this data directly into Filebeat would be great. For instance integrated in the /stats results, this could look like the following:

{
  "beat": {
    "info": {
      "ephemeral_id": "62e0e489-14c5-4cbd-a87a-f2ebf4643a7a",
      "name": "filebeat",
      "uptime": {
        "ms": 205465136
      },
      "version": "8.3.3"
      "registry_update": {
        "timestamp": 1664896065
      }
    }
  }
}

New Metric: Kafka Connectivity Status

In the same vein as before we are monitoring the connectivity state by parsing the output of filebeat -e -c /etc/filebeat/filebeat.yml test output in order to ensure that all Kafka brokers can be contacted.

This would help tremendously to either have this kind of repetitive check as part of Filebeat, or simply keeping up with the amount of brokers in each state, independently of the configuration.

As before, integrated in the /stats results, this could look like the following:

{
  "libbeat": {
    "output": {
      "events": {},
      "read": {},
      "type": "kafka",
      "write": {},
      "brokers": {
        "pending": 1,
        "failed": 0,
        "connected": 2,
      }
    }
  }
}

Let me know if you need more details.

Best regards,
Antoine.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 4, 2022
@botelastic
Copy link

botelastic bot commented Oct 4, 2022

This issue doesn't have a Team:<team> label.

@botelastic
Copy link

botelastic bot commented Oct 4, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 4, 2023
@aveuiller
Copy link
Author

👍

@botelastic botelastic bot removed the Stalled label Oct 6, 2023
@botelastic
Copy link

botelastic bot commented Oct 5, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs_team Indicates that the issue/PR needs a Team:* label Stalled
Projects
None yet
Development

No branches or pull requests

1 participant