-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/prometheus] Histograms without buckets are dropped #22070
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Requiring prometheus histograms to have at least one bucket is in-line with the OpenMetrics spec for histograms:
So it MUST have an Inf bucket, whose value is the same as the sum of the series. An OpenTelemetry histogram without any buckets MUST, per the spec, have an Inf bucket if it is exported by a Prometheus exporter:
How did you encounter a prometheus histogram without any buckets? |
I dropped the buckets at ingest using a relabeling rule. To give a more specific example, the Kubernetes Apiserver exposes a - action: keep
regex: apiserver_request_duration_seconds_(?:count|sum)
sourceLabels:
- __name__ If I do this using plain Prometheus, I will get |
There's also #22853 as a solution to the same problem, but it's not as efficient to have to scrape and process the whole Histogram just to drop all the buckets later. |
@dashpole in a similar vein, the receiver rejects a Summary without quantiles, while Prometheus happily accepts it. I'm reporting this in the context of attempting to migrate a large K8s environment from Prometheus to Otel with prometheus receiver. And so I need to deal with targets producing metrics that aren't up to spec, with scrape configs I don't own or control, and other fun bits like that. I suppose the fundamental problem here is that Prometheus itself is very permissive about what it accepts, and doesn't actually check any of the spec requirements. Whereas the receiver is more strict, both on the OpenMetrics and Otel sides. At a high level, what can we do about this? Some kind of permissive mode where if we run into a problem, we just shrug and ingest the datapoint as Unknown? The Histogram problem which this issue describes can be worked around later in the Otel pipeline, but the Summary problem causes the whole scrape to fail, so the only solution seems to be to fix all the consumers. It would at the very least be nice if we could ingest all the correct data from a given scrape even if there are errors along the way. |
Thanks, and i'm sorry for making it harder on you. I think we should revisit some of the validation, especially for summaries. Off the top of my head, I don't remember why we require quantiles. The only thing that would make sense would be if quantiles were required in OTLP... if they aren't, we should definitely remove that requirement. OTel only has summaries to support these sort of "legacy" cases, so extra validation doesn't make any sense for summaries. My recollection of the +Inf bucket requirement for histograms was that it protected against client-side bugs where some histogram buckets were missing from the exposition. It might be worth making it configurable, or dropping the requirement all-together. |
I'll raise this at the prometheus wg next wednesday |
Summary quantiles should be optional: https://github.com/open-telemetry/opentelemetry-proto/blob/c4dfbc51f3cd4089778555a2ac5d9bc093ed2956/opentelemetry/proto/metrics/v1/metrics.proto#L634. We should definitely remove the requirement to have a quantile. For histograms, the +Inf bucket is always the same as the _sum series. We can either:
|
Good to hear. I'll create a new issue for this, and can work on the change afterwards, if that's ok.
Maybe just validate that the value of this bucket is correct, if it exists? If it doesn't exist, we can just ignore it. |
I was the only one at the wg meeting today, so I wasn't able to discuss it. @Aneurysm9, unless you have concerns, I think we should start to ignore the +Inf bucket. The only potential negative consequence would be that the +Inf bucket would "reappear" if you exported it using a prometheus exporter, but that seems unlikely to cause issues. |
Ignoring the Inf bucket shouldn't be breaking, right? It is always the same as the count. |
I think I was misunderstanding what you wanted to do there. So, just to be clear:
Is that right? |
Yes. I think the end result would be something that produces a single OTel bucket even if there are no prometheus buckets. |
Allright, that's what I've done in #23448. |
**Description:** Prometheus receiver currently drops Histograms without any buckets. These are, however, explicitly allowed by the Otel spec, and can be quite useful. This change allows ingesting them. When we do so, we add an additional bucket at +Inf equal to the `count` attribute of the Histogram. **Link to tracking Issue:** #22070 **Testing:** Modified existing tests.
Fixed in #23448 |
Component(s)
receiver/prometheus
What happened?
I wanted to ingest just the sum and count timeseries for a histogram using prometheus receiver, and used an appropriate relabel config. But the receiver dropped the histogram.
Description
If we ingest a histogram without buckets - only the count and sum parts - prometheusreceiver drops it.
Steps to Reproduce
Simplest way is probably to modify data for one of the unit tests. I verified it with
opentelemetry-collector-contrib/receiver/prometheusreceiver/metrics_receiver_test.go
Line 1331 in 19cacb0
Expected Result
I'd have expected to get a histogram without buckets. The spec says this is valid.
Actual Result
The histogram was dropped.
Collector version
0.77.0
Additional context
What I'm trying to do is to ingest only the sum and count parts of the histogram. This works on Prometheus itself.
The alternative would be to ingest the buckets, and then drop them in a processor. There currently aren't any functions to do this to histograms, but we do have ones for summaries, so it wouldn't be a stretch to add them.
I believe this happens as a result of the following check:
opentelemetry-collector-contrib/receiver/prometheusreceiver/internal/metricfamily.go
Line 90 in 19cacb0
I haven't verified if this affects simpleprometheusreceiver, but I assume it does.
The text was updated successfully, but these errors were encountered: