Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498

Closed
newly12 opened this issue Jul 16, 2022 · 14 comments
Labels

Comments

@newly12
Copy link
Contributor

newly12 commented Jul 16, 2022

Is your feature request related to a problem? Please describe.
Failed to scrape cadvisor metrics due to timestamp validation of the combined metrics, however container_cpu_load_average_10s is gauge

2022-07-16T10:15:54.277+0800    debug   scrape/scrape.go:1552   Unexpected error        {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_pool": "kubernetes-cadvisor-metrics", "target": "https://xxx:10250/metrics/cadvisor", "series": "container_cpu_load_average_10s{container=\"\",id=\"/kubepods.slice\",image=\"\",name=\"\",namespace=\"\",pod=\"\"}", "error": "inconsistent timestamps on metric points for metric container_cpu_load_average_10s"}
2022-07-16T10:15:54.277+0800    debug   scrape/scrape.go:1328   Append failed   {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_pool": "kubernetes-cadvisor-metrics", "target": "https://xxx:10250/metrics/cadvisor", "error": "inconsistent timestamps on metric points for metric container_cpu_load_average_10s"}
2022-07-16T10:15:54.277+0800    warn    internal/otlp_metricsbuilder.go:164     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "pipeline": "metrics", "scrape_timestamp": 1657937752544, "target_labels": "map[__name__:up instance:xxx:10250 job:kubernetes-cadvisor-metrics]"}
2022-07-16T10:15:54.278+0800    INFO    loggingexporter/logging_exporter.go:57  MetricsExporter {"#metrics": 5}
...
Metric #4
Descriptor:
     -> Name: up
     -> Description: The scraping was successful
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-07-16 02:15:52.544 +0000 UTC
Value: 0.000000

help text for container_cpu_load_average_10s metric

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge

Describe the solution you'd like
If histogram and summary points are required to have consistent timestamp(if I read #9385 correctly), should the validation be skipped for other metric types?

Describe alternatives you've considered

Additional context
can be reproduced by given such metrics

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge
container_cpu_load_average_10s{container="",id="/",image="",name="",namespace="",pod=""} 0 1658124636389
container_cpu_load_average_10s{container="",id="/kubepods.slice",image="",name="",namespace="",pod=""} 0 1658124637748
container_cpu_load_average_10s{container="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod=""} 0 1658124640330
@newly12
Copy link
Contributor Author

newly12 commented Jul 18, 2022

It looks like I can't reproduce the issue with above metrics, however while scrapping cadavisor metrics from one cluster, all of them are failing, but some other clusters seem to be ok, I will dig some more into this and get back here.

@dashpole
Copy link
Contributor

That does look problematic. Let me know if you want help investigating. I'm also OK reverting the PR in the meantime.

@newly12
Copy link
Contributor Author

newly12 commented Jul 19, 2022

@dashpole I missed one part in the config while reproducing this error, I can reproduce the error if I add such config in prometheus receiver config,

          metric_relabel_configs:
            - action: labeldrop
              regex: id

given metrics like this

# HELP container_cpu_load_average_10s Value of container cpu load average over the last 10 seconds.
# TYPE container_cpu_load_average_10s gauge
container_cpu_load_average_10s{container="",id="/",image="",name="",namespace="",pod=""} 0 1658124636389
container_cpu_load_average_10s{container="",id="/kubepods.slice",image="",name="",namespace="",pod=""} 0 1658124637748
container_cpu_load_average_10s{container="",id="/kubepods.slice/kubepods-besteffort.slice",image="",name="",namespace="",pod=""} 0 1658124640330

their groupKey will be the same as id label is dropped thus the metricGroup will be the same, as each of them have a different timestamp tagged, we got the inconsistent timestamps error.

@dashpole
Copy link
Contributor

Hmmm... OK. I think only doing timestamp checks for histograms and summaries is a good idea. But it does mean that if you cause two metric streams to collide for a histogram using labeldrop, you can still hit this problem (e.g. if container_cpu_load_average_10s was actually a histogram`), since it will try and merge the two histograms into one histogram.

@newly12
Copy link
Contributor Author

newly12 commented Jul 20, 2022

I guess it is hard to tell what should be the right behavior, I do have #12522 to skip the validation for counter & gauge, but as you mentioned, it could run into issues for histogram & summary, I am not sure if prometheus has such enforcement(haven't tested it), maybe we should align with prometheus and remain the same behavior, or maybe have an option for user to disable the validation even though it may cause inaccurate result?

@dashpole
Copy link
Contributor

Prometheus doesn't have a strict notion of a histogram in the protocol (OpenMetrics does), so it doesn't have any enforcement. It is more of a convention to group together a bunch of counters into a histogram.

But even if we were to skip validation on two colliding histograms, they would likely fail to produce a valid histogram (buckets could be duplicated, or could produce a case where a higher bucket has a smaller cumulative value).

@mx-psi mx-psi added the receiver/prometheus Prometheus receiver label Jul 21, 2022
@github-actions
Copy link
Contributor

Pinging code owners: @Aneurysm9 @dashpole

@newly12
Copy link
Contributor Author

newly12 commented Aug 11, 2022

But even if we were to skip validation on two colliding histograms, they would likely fail to produce a valid histogram (buckets could be duplicated, or could produce a case where a higher bucket has a smaller cumulative value).

yeah that's valid concern, any suggestion for handling this kind of issue? is there a way to drop only colliding metrics instead of failing all scrapped metrics in one batch/group?

@dashpole
Copy link
Contributor

is there a way to drop only colliding metrics instead of failing all scrapped metrics in one batch/group?

I think thats already the behavior today... Does it seem to be behaving differently?

Overall, I would recommend using the metricstransform processor to aggregate the label away (you can sum, or average the colliding metrics), rather than dropping the dimension in prom relabel rules: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#metrics-transform-processor

@evan-bradley evan-bradley added question Further information is requested waiting for author labels Sep 9, 2022
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 11, 2022
@jmacd
Copy link
Contributor

jmacd commented Mar 29, 2023

their groupKey will be the same as id label is dropped thus the metricGroup will be the same

Isn't this a single-writer rule violation? It sounds like you've erased a label and produced a bunch of points w/ the same labels and different timestamps. That, in my understanding, is the definition of overlapping streams, something that Prometheus cautions you against doing but does not enforce. Is the problem that OTel is enforcing this constraint where Prometheus wasn't?

@jmacd
Copy link
Contributor

jmacd commented Apr 5, 2023

My recommendation is the leave the existing validation for any metric data point that has defined temporality (e.g., Counter, UpDownCounter, Histogram) AND any metric data point that is comprised of multiple Prometheus series (e.g., Histogram, Summary). This leaves the potential to relax this validation for Gauges only because generally Gauges do not require a start timestamp to establish their correct interpretation, therefore a single-writer violation is not a first-order problem. A combination of gauges from multiple locations will, when combined, look like they came from a single writer with many interleaved data points.

@github-actions github-actions bot removed the Stale label May 26, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 26, 2023
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants