-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prometheusreceiver] should not validate combined metrics timestamp for type like gauge, counter #12498
Comments
It looks like I can't reproduce the issue with above metrics, however while scrapping cadavisor metrics from one cluster, all of them are failing, but some other clusters seem to be ok, I will dig some more into this and get back here. |
That does look problematic. Let me know if you want help investigating. I'm also OK reverting the PR in the meantime. |
@dashpole I missed one part in the config while reproducing this error, I can reproduce the error if I add such config in prometheus receiver config,
given metrics like this
their |
Hmmm... OK. I think only doing timestamp checks for histograms and summaries is a good idea. But it does mean that if you cause two metric streams to collide for a histogram using |
I guess it is hard to tell what should be the right behavior, I do have #12522 to skip the validation for counter & gauge, but as you mentioned, it could run into issues for histogram & summary, I am not sure if prometheus has such enforcement(haven't tested it), maybe we should align with prometheus and remain the same behavior, or maybe have an option for user to disable the validation even though it may cause inaccurate result? |
Prometheus doesn't have a strict notion of a histogram in the protocol (OpenMetrics does), so it doesn't have any enforcement. It is more of a convention to group together a bunch of counters into a histogram. But even if we were to skip validation on two colliding histograms, they would likely fail to produce a valid histogram (buckets could be duplicated, or could produce a case where a higher bucket has a smaller cumulative value). |
Pinging code owners: @Aneurysm9 @dashpole |
yeah that's valid concern, any suggestion for handling this kind of issue? is there a way to drop only colliding metrics instead of failing all scrapped metrics in one batch/group? |
I think thats already the behavior today... Does it seem to be behaving differently? Overall, I would recommend using the metricstransform processor to aggregate the label away (you can sum, or average the colliding metrics), rather than dropping the dimension in prom relabel rules: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#metrics-transform-processor |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Isn't this a single-writer rule violation? It sounds like you've erased a label and produced a bunch of points w/ the same labels and different timestamps. That, in my understanding, is the definition of overlapping streams, something that Prometheus cautions you against doing but does not enforce. Is the problem that OTel is enforcing this constraint where Prometheus wasn't? |
My recommendation is the leave the existing validation for any metric data point that has defined temporality (e.g., Counter, UpDownCounter, Histogram) AND any metric data point that is comprised of multiple Prometheus series (e.g., Histogram, Summary). This leaves the potential to relax this validation for Gauges only because generally Gauges do not require a start timestamp to establish their correct interpretation, therefore a single-writer violation is not a first-order problem. A combination of gauges from multiple locations will, when combined, look like they came from a single writer with many interleaved data points. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been closed as inactive because it has been stale for 120 days with no activity. |
Is your feature request related to a problem? Please describe.
Failed to scrape cadvisor metrics due to timestamp validation of the combined metrics, however
container_cpu_load_average_10s
isgauge
help text for
container_cpu_load_average_10s
metricDescribe the solution you'd like
If histogram and summary points are required to have consistent timestamp(if I read #9385 correctly), should the validation be skipped for other metric types?
Describe alternatives you've considered
Additional context
can be reproduced by given such metrics
The text was updated successfully, but these errors were encountered: