Metricbeat GCP Module panics on nil Metadata in Metricset #22494

andrewstucki · 2020-11-09T13:26:06Z

Version: 7.9.3
Steps to Reproduce: Create a GCP metric that has no metadata. Configure metricbeat to retrieve it. Metricbeat panics.

This appears to be coming from this line:

beats/x-pack/metricbeat/module/googlecloud/metrics/metricset.go

Line 256 in f0bad21

if out.Metadata.SamplePeriod != nil {

I believe we're dereferencing the Metadata field, which, according to the structure is optional and can be nil. We should throw in a guard for the nil check.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-10T09:49:07Z

Pinging @elastic/integrations-platforms (Team:Platforms)

sayden · 2020-11-10T11:39:22Z

I also think this is a bug in the GCP module, I'm just not sure about the proper fix for it without losing accuracy. Metricbeat checks the sample period of the incoming metrics because Metricbeat sample period is usually not the same.

Generally speaking about any MB module, it setups a 30 seconds fetch period, it fetchs the metrics in the service and it stores them in Elasticsearch. So the user gets a full picture of the service state every 30 seconds. In GCP however, it happens that the sample period they have is, for example, 5 minutes while Metricbeat sample period is 1 minute. So Metricbeat might fetch the exact same full picture 5 times, giving incorrect aggregation results later.

On the other side, if Metricbeat ignores metrics without a SamplePeriod value, it might be losing metrics. So MB should "know" which ones can be safely inserted without a sample period value and without giving wrong aggregation results later.

sayden · 2020-11-10T11:42:55Z

Just to give more info: it seems that the sample period is not mandatory to every Stackdriver metric: https://cloud.google.com/monitoring/api/metrics#metadata

The description for each metric type might include additional information, called metadata, about the metric.

andrewstucki · 2020-11-10T13:21:19Z

@sayden so, interestingly, if you take a look at the file, there are default meta values in case there's no SamplePeriod or IngestDelay fields set on the Metadata field. I think the simplest fix looks like just wrapping both conditionals in a Metadata nil check, like this:

if out.Metadata != nil { // <-- this is new
    if out.Metadata.SamplePeriod != nil {
        ...	
    }

    if out.Metadata.IngestDelay != nil {
        ...
    }
}

sayden · 2020-11-17T18:02:07Z

@andrewstucki maybe it's not that simple. My suggestion is to add the nil checkers but then double check that we are not receiving duplicate events from the metrics that don't have metadata and that we aren't ignoring metrics.

In different words, my worry is that the fetch operation will bring duplicates or simply will omit those metrics. We have 2 scenarios:

Metadata isn't present so Metricbeat doesn't stores the metric in Elasticsearch. So a metric is specified but later, the user can't see it in Elasticsearch.
Metadata isn't present so Metricbeat inserts the metric ignoring the sample period. The problem here is that for a fetch period of 1 minute in Metricbeat and a metric that is sampled every 5 minutes by GCP (even if it isn't included in the metadata) we are storing 4 times the same metric value and timestamp before getting a new sampled value. Giving wrong aggregation values in Kibana.

andresrc · 2020-12-18T09:47:17Z

@masci At least I think we should try to prioritise some mitigation. Given the conversation it seems that the "right" solution, if any may require some discussion but a panic it's a pretty bad thing. Given that we are able to detect the situation we should do some "safe" (e.g. trying no to ship incorrect data) like skipping the event (ideally with a "throttled" warning) which will always be better than a panic

masci · 2020-12-18T11:21:06Z

I think the only quick workaround is the one proposed by @andrewstucki - in the end, whether we're panicking or dropping, the metric wouldn't make it through elasticsearch. Do we have any facility to implement a throttled warning?
A proper fix to cover @sayden notes will take some time.

andresrc · 2020-12-18T11:37:28Z

I agree, I prefer dropping to panicking, as we might be collecting other things with the same instance. @urso do we have something to help implement some throttled warning or should we just start with something simpler?

urso · 2021-01-11T13:16:51Z

do we have something to help implement some throttled warning or should we just start with something simpler?

No, we have no support for throttling warnings/errors to logs. Something "simpler" would be better.

botelastic · 2022-01-27T12:48:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andrewstucki added Metricbeat Metricbeat Team:Observability labels Nov 9, 2020

andresrc added Team:Platforms Label for the Integrations - Platforms team and removed Team:Observability labels Nov 10, 2020

kaiyan-sheng self-assigned this Nov 10, 2020

masci unassigned kaiyan-sheng Dec 17, 2020

botelastic bot added the Stalled label Jan 27, 2022

andresrc added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Jun 16, 2022

botelastic bot removed the Stalled label Jun 16, 2022

andresrc added bug and removed Team:Platforms Label for the Integrations - Platforms team labels Jun 16, 2022

gpop63 mentioned this issue Jul 10, 2022

[metricbeat] gcp: check for nil metadata #32281

Merged

6 tasks

gpop63 closed this as completed in #32281 Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

andrewstucki commented Nov 9, 2020

elasticmachine commented Nov 10, 2020

sayden commented Nov 10, 2020

sayden commented Nov 10, 2020

andrewstucki commented Nov 10, 2020

sayden commented Nov 17, 2020

andresrc commented Dec 18, 2020

masci commented Dec 18, 2020

andresrc commented Dec 18, 2020

urso commented Jan 11, 2021

botelastic bot commented Jan 27, 2022

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

Metricbeat GCP Module panics on nil Metadata in Metricset #22494

Comments

andrewstucki commented Nov 9, 2020

elasticmachine commented Nov 10, 2020

sayden commented Nov 10, 2020

sayden commented Nov 10, 2020

andrewstucki commented Nov 10, 2020

sayden commented Nov 17, 2020

andresrc commented Dec 18, 2020

masci commented Dec 18, 2020

andresrc commented Dec 18, 2020

urso commented Jan 11, 2021

botelastic bot commented Jan 27, 2022