Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

Closed
andrewstucki opened this issue Nov 9, 2020 · 10 comments · Fixed by #32281
Closed

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

andrewstucki opened this issue Nov 9, 2020 · 10 comments · Fixed by #32281
Labels
bug Metricbeat Metricbeat Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@andrewstucki
Copy link

  • Version: 7.9.3
  • Steps to Reproduce: Create a GCP metric that has no metadata. Configure metricbeat to retrieve it. Metricbeat panics.

This appears to be coming from this line:

if out.Metadata.SamplePeriod != nil {

I believe we're dereferencing the Metadata field, which, according to the structure is optional and can be nil. We should throw in a guard for the nil check.

@andresrc andresrc added Team:Platforms Label for the Integrations - Platforms team and removed Team:Observability labels Nov 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@sayden
Copy link
Contributor

sayden commented Nov 10, 2020

I also think this is a bug in the GCP module, I'm just not sure about the proper fix for it without losing accuracy. Metricbeat checks the sample period of the incoming metrics because Metricbeat sample period is usually not the same.

Generally speaking about any MB module, it setups a 30 seconds fetch period, it fetchs the metrics in the service and it stores them in Elasticsearch. So the user gets a full picture of the service state every 30 seconds. In GCP however, it happens that the sample period they have is, for example, 5 minutes while Metricbeat sample period is 1 minute. So Metricbeat might fetch the exact same full picture 5 times, giving incorrect aggregation results later.

On the other side, if Metricbeat ignores metrics without a SamplePeriod value, it might be losing metrics. So MB should "know" which ones can be safely inserted without a sample period value and without giving wrong aggregation results later.

@sayden
Copy link
Contributor

sayden commented Nov 10, 2020

Just to give more info: it seems that the sample period is not mandatory to every Stackdriver metric: https://cloud.google.com/monitoring/api/metrics#metadata

The description for each metric type might include additional information, called metadata, about the metric.

@andrewstucki
Copy link
Author

@sayden so, interestingly, if you take a look at the file, there are default meta values in case there's no SamplePeriod or IngestDelay fields set on the Metadata field. I think the simplest fix looks like just wrapping both conditionals in a Metadata nil check, like this:

if out.Metadata != nil { // <-- this is new
    if out.Metadata.SamplePeriod != nil {
        ...	
    }

    if out.Metadata.IngestDelay != nil {
        ...
    }
}

@kaiyan-sheng kaiyan-sheng self-assigned this Nov 10, 2020
@sayden
Copy link
Contributor

sayden commented Nov 17, 2020

@andrewstucki maybe it's not that simple. My suggestion is to add the nil checkers but then double check that we are not receiving duplicate events from the metrics that don't have metadata and that we aren't ignoring metrics.

In different words, my worry is that the fetch operation will bring duplicates or simply will omit those metrics. We have 2 scenarios:

  • Metadata isn't present so Metricbeat doesn't stores the metric in Elasticsearch. So a metric is specified but later, the user can't see it in Elasticsearch.
  • Metadata isn't present so Metricbeat inserts the metric ignoring the sample period. The problem here is that for a fetch period of 1 minute in Metricbeat and a metric that is sampled every 5 minutes by GCP (even if it isn't included in the metadata) we are storing 4 times the same metric value and timestamp before getting a new sampled value. Giving wrong aggregation values in Kibana.

@andresrc
Copy link
Contributor

@masci At least I think we should try to prioritise some mitigation. Given the conversation it seems that the "right" solution, if any may require some discussion but a panic it's a pretty bad thing. Given that we are able to detect the situation we should do some "safe" (e.g. trying no to ship incorrect data) like skipping the event (ideally with a "throttled" warning) which will always be better than a panic

@masci
Copy link

masci commented Dec 18, 2020

I think the only quick workaround is the one proposed by @andrewstucki - in the end, whether we're panicking or dropping, the metric wouldn't make it through elasticsearch. Do we have any facility to implement a throttled warning?
A proper fix to cover @sayden notes will take some time.

@andresrc
Copy link
Contributor

I agree, I prefer dropping to panicking, as we might be collecting other things with the same instance. @urso do we have something to help implement some throttled warning or should we just start with something simpler?

@urso
Copy link

urso commented Jan 11, 2021

do we have something to help implement some throttled warning or should we just start with something simpler?

No, we have no support for throttling warnings/errors to logs. Something "simpler" would be better.

@botelastic
Copy link

botelastic bot commented Jan 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the Stalled label Jan 27, 2022
@andresrc andresrc added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Jun 16, 2022
@botelastic botelastic bot removed the Stalled label Jun 16, 2022
@andresrc andresrc added bug and removed Team:Platforms Label for the Integrations - Platforms team labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Metricbeat Metricbeat Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants