[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

newly12 · 2022-07-16T02:19:59Z

Is your feature request related to a problem? Please describe.
Failed to scrape cadvisor metrics due to timestamp validation of the combined metrics, however container_cpu_load_average_10s is gauge

2022-07-16T10:15:54.277+0800    debug   scrape/scrape.go:1552   Unexpected error        {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_pool": "kubernetes-cadvisor-metrics", "target": "https://xxx:10250/metrics/cadvisor", "series": "container_cpu_load_average_10s{container=\"\",id=\"/kubepods.slice\",image=\"\",name=\"\",namespace=\"\",pod=\"\"}", "error": "inconsistent timestamps on metric points for metric container_cpu_load_average_10s"}
2022-07-16T10:15:54.277+0800    debug   scrape/scrape.go:1328   Append failed   {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_pool": "kubernetes-cadvisor-metrics", "target": "https://xxx:10250/metrics/cadvisor", "error": "inconsistent timestamps on metric points for metric container_cpu_load_average_10s"}
2022-07-16T10:15:54.277+0800    warn    internal/otlp_metricsbuilder.go:164     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_timestamp": 1657937752544, "target_labels": "map[__name__:up instance:xxx:10250 job:kubernetes-cadvisor-metrics]"}
2022-07-16T10:15:54.278+0800    INFO    loggingexporter/logging_exporter.go:57  MetricsExporter {"#metrics": 5}
...
Metric #4
Descriptor:
     -> Name: up
     -> Description: The scraping was successful
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-07-16 02:15:52.544 +0000 UTC
Value: 0.000000

help text for container_cpu_load_average_10s metric

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge

Describe the solution you'd like
If histogram and summary points are required to have consistent timestamp(if I read #9385 correctly), should the validation be skipped for other metric types?

Describe alternatives you've considered

Additional context
can be reproduced by given such metrics

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge
container_cpu_load_average_10s{container="",id="/",image="",name="",namespace="",pod=""} 0 1658124636389
container_cpu_load_average_10s{container="",id="/kubepods.slice",image="",name="",namespace="",pod=""} 0 1658124637748
container_cpu_load_average_10s{container="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod=""} 0 1658124640330

The text was updated successfully, but these errors were encountered:

newly12 · 2022-07-18T13:01:27Z

It looks like I can't reproduce the issue with above metrics, however while scrapping cadavisor metrics from one cluster, all of them are failing, but some other clusters seem to be ok, I will dig some more into this and get back here.

dashpole · 2022-07-18T13:36:33Z

That does look problematic. Let me know if you want help investigating. I'm also OK reverting the PR in the meantime.

newly12 · 2022-07-19T02:01:57Z

@dashpole I missed one part in the config while reproducing this error, I can reproduce the error if I add such config in prometheus receiver config,

          metric_relabel_configs:
            - action: labeldrop
              regex: id

given metrics like this

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge
container_cpu_load_average_10s{container="",id="/",image="",name="",namespace="",pod=""} 0 1658124636389
container_cpu_load_average_10s{container="",id="/kubepods.slice",image="",name="",namespace="",pod=""} 0 1658124637748
container_cpu_load_average_10s{container="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod=""} 0 1658124640330

their groupKey will be the same as id label is dropped thus the metricGroup will be the same, as each of them have a different timestamp tagged, we got the inconsistent timestamps error.

dashpole · 2022-07-19T14:25:07Z

Hmmm... OK. I think only doing timestamp checks for histograms and summaries is a good idea. But it does mean that if you cause two metric streams to collide for a histogram using labeldrop, you can still hit this problem (e.g. if container_cpu_load_average_10s was actually a histogram`), since it will try and merge the two histograms into one histogram.

newly12 · 2022-07-20T11:27:59Z

I guess it is hard to tell what should be the right behavior, I do have #12522 to skip the validation for counter & gauge, but as you mentioned, it could run into issues for histogram & summary, I am not sure if prometheus has such enforcement(haven't tested it), maybe we should align with prometheus and remain the same behavior, or maybe have an option for user to disable the validation even though it may cause inaccurate result?

dashpole · 2022-07-20T13:52:49Z

Prometheus doesn't have a strict notion of a histogram in the protocol (OpenMetrics does), so it doesn't have any enforcement. It is more of a convention to group together a bunch of counters into a histogram.

But even if we were to skip validation on two colliding histograms, they would likely fail to produce a valid histogram (buckets could be duplicated, or could produce a case where a higher bucket has a smaller cumulative value).

github-actions · 2022-07-21T11:13:40Z

Pinging code owners: @Aneurysm9 @dashpole

newly12 · 2022-08-11T06:13:34Z

But even if we were to skip validation on two colliding histograms, they would likely fail to produce a valid histogram (buckets could be duplicated, or could produce a case where a higher bucket has a smaller cumulative value).

yeah that's valid concern, any suggestion for handling this kind of issue? is there a way to drop only colliding metrics instead of failing all scrapped metrics in one batch/group?

dashpole · 2022-08-11T14:59:43Z

is there a way to drop only colliding metrics instead of failing all scrapped metrics in one batch/group?

I think thats already the behavior today... Does it seem to be behaving differently?

Overall, I would recommend using the metricstransform processor to aggregate the label away (you can sum, or average the colliding metrics), rather than dropping the dimension in prom relabel rules: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#metrics-transform-processor

github-actions · 2022-11-11T03:46:50Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jmacd · 2023-03-29T15:00:58Z

their groupKey will be the same as id label is dropped thus the metricGroup will be the same

Isn't this a single-writer rule violation? It sounds like you've erased a label and produced a bunch of points w/ the same labels and different timestamps. That, in my understanding, is the definition of overlapping streams, something that Prometheus cautions you against doing but does not enforce. Is the problem that OTel is enforcing this constraint where Prometheus wasn't?

jmacd · 2023-04-05T21:18:59Z

My recommendation is the leave the existing validation for any metric data point that has defined temporality (e.g., Counter, UpDownCounter, Histogram) AND any metric data point that is comprised of multiple Prometheus series (e.g., Histogram, Summary). This leaves the potential to relax this validation for Gauges only because generally Gauges do not require a start timestamp to establish their correct interpretation, therefore a single-writer violation is not a first-order problem. A combination of gauges from multiple locations will, when combined, look like they came from a single writer with many interleaved data points.

github-actions · 2023-07-26T03:30:58Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-09-24T05:19:05Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

newly12 mentioned this issue Jul 18, 2022

prometheusreceiver: Validate that combined metrics all have the same timestamp for Histogram and Summary only #12522

Closed

mx-psi added the receiver/prometheus Prometheus receiver label Jul 21, 2022

evan-bradley added question Further information is requested waiting for author labels Sep 9, 2022

github-actions bot added the Stale label Nov 11, 2022

github-actions bot removed the Stale label May 26, 2023

github-actions bot added the Stale label Jul 26, 2023

github-actions bot added the closed as inactive label Sep 24, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

newly12 commented Jul 16, 2022 •

edited

Loading

newly12 commented Jul 18, 2022

dashpole commented Jul 18, 2022

newly12 commented Jul 19, 2022

dashpole commented Jul 19, 2022

newly12 commented Jul 20, 2022

dashpole commented Jul 20, 2022

github-actions bot commented Jul 21, 2022

newly12 commented Aug 11, 2022

dashpole commented Aug 11, 2022

github-actions bot commented Nov 11, 2022

jmacd commented Mar 29, 2023

jmacd commented Apr 5, 2023

github-actions bot commented Jul 26, 2023

github-actions bot commented Sep 24, 2023

[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

Comments

newly12 commented Jul 16, 2022 • edited Loading

newly12 commented Jul 18, 2022

dashpole commented Jul 18, 2022

newly12 commented Jul 19, 2022

dashpole commented Jul 19, 2022

newly12 commented Jul 20, 2022

dashpole commented Jul 20, 2022

github-actions bot commented Jul 21, 2022

newly12 commented Aug 11, 2022

dashpole commented Aug 11, 2022

github-actions bot commented Nov 11, 2022

jmacd commented Mar 29, 2023

jmacd commented Apr 5, 2023

github-actions bot commented Jul 26, 2023

github-actions bot commented Sep 24, 2023

newly12 commented Jul 16, 2022 •

edited

Loading