-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt default histogram boundaries to seconds as the new base unit #3509
Adapt default histogram boundaries to seconds as the new base unit #3509
Conversation
Signed-off-by: Fabian Stäber <[email protected]>
I don't think we can do that, since we marked the defaults for the explicit bucket histogram as stable. We were able to change it for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These buckets are used for more than measurements of time. Changing their value is not motivated.
I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.
I'm not sure about this argument. The current defaults were explicitly defined for representing durations in milliseconds (they were copied from Prometheus' defaults for seconds and scaled by a factor or 1000). |
"Designed for" != "Used for" |
I am now not sure anymore about the outcome of that discussion. My understanding was that changing the default buckets for the Histogram instrument itself is a breaking change, but changing the defaults for certain semantic conventions is not. This is because the Histogram spec is marked stable, but the semantic conventions are not. IIRC, that was our "out" in #2977 for being compatible but also not breaking the specification. All the semconv metrics recorded with OTel should be compatible with Prometheus, as the defaults are now in seconds. The hints/advice API gives you a way to set these defaults. Instrumentations that want to be compliant with the semantic convention specification must export these metrics in seconds. For custom metrics outside the semantic conventions, users would probably want to set up "the right buckets" themselves (or use exponential histograms). They are the domain experts. My feeling is that the defaults we have today are very useful to many people in many situations, and should not be changed, even if it turns out not to be a breaking change. |
Changing bucket boundaries is not considered a breaking change for instrumentation (this is also true in prometheus land). It does alter error rates, but should not alter interaction. That said, changing unit does break downstream usage, so perhaps there was a conflation of concerns there. |
Just to clarify: This is about changing the default buckets, not about changing the unit. If a user wants to measure latencies of a custom business transaction, the current OpenTelemetry default behavior is as follows:
So all observations < 5s will end up in the same bucket. This will make the default histogram useless in many cases. I agree that you can come up with scenarios where the current buckets are useful, but I think the most common usage of histograms is to measure latencies, and the default should be reasonable buckets for measuring latencies in seconds. |
Not sure I follow. Could you elaborate on this a little more? Do you mean if all of the data falls in the For someone who is using a stable metrics SDK today, recording business transaction latencies in milliseconds (since it was the default before) this would break their stuff, right? They have their alerts configured to trigger at 250ms, relying on the fact that the SDK provides consistent data. Their backend might not do automatic unit conversion or not take the unit into account at all. Then they upgrade to a later version of the metrics SDK that implements this updated spec, and all of a sudden all of their data ends up in the bucket >10 since the way the data is calculated did not change. I guess this is dependent on the backend implementation, but either the alerts now trigger all the time or never. Neither is good, and it requires the user to change code in order to upgrade the SDK, which feels like a breaking change to me - but please correct me if I am wrong here. I could also imagine the person seeing the alerts might not be the same person that did the SDK upgrade. If we don't change the defaults, users can go in in a coordinated way whenever they have the time, decide what unit makes the most sense for their use-case, switch the buckets over seconds-based buckets (if they decide that is the way to go), update how the data is collected (or, more realistically, multiply the value that they collected in milliseconds to get to seconds), and set up their alerts while they are touching the code anyway. I feel like I am missing an important piece of the puzzle here that seems clear to everyone else 😅 For semantic conventions, we don't have this problem. We can switch over the instruments that collect semantic conventions to seconds-based buckets, and when user dashboard/alerts break, we can point to the notice saying "Don't rely on the semantic conventions yet, they were not marked as stable". These semantic conventions actually report data in seconds from now on. |
Yes, the current buckets make sense for measuring latencies in milliseconds. However, OpenTelemetry decided to switch to seconds, and I think the default should not be "milliseconds for custom metrics" and "seconds for semantic conventions". |
I get that, my point is: The Metrics SDK specification containing the boundaries is marked stable. Users are relying on the default buckets defined in the stable spec for their custom metrics. If the spec changes the defaults, SDK authors follow the specification and release a new version with updated buckets. After updating the SDK, the user's histograms can't be used for estimations anymore because everything ends up in one bucket. This happens even though everything that user used was marked stable and no breaking change was indicated. I think it is a breaking change, but @jsuereth pointed out that it is not, IIUC (?) Finally, users might actually have consciously decided for going with milliseconds (or some other unit that aligns well with these buckets) because it is the best fit for their metric or they don't care about Prometheus because they don't use it - this change would force them to change their buckets manually to what they assumed was a good and stable default. |
I agree with this PR. I created an issue and then noticed this PR - #3525 There are several options:
Unfortunately, OTel currently is in the bad state. Are https://github.com/open-telemetry/opentelemetry-specification and https://github.com/open-telemetry/semantic-conventions different people? I think some communication is required to figure out how to get back to a good state. |
@JamesNK / @fstab have you seen #2977? Contains a lot of context around this issue. Summary is:
Its not ideal, but with the advice API it should be ok. The default bucket boundaries were only ever going to be useful for a pretty particular use case. Instrumentation can use advice to ensure resulting bucket boundaries are useful out of the box. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Changes
There was a decision by the TC (technical committee) to use seconds as the base unit in OpenTelemetry rather than milliseconds.
As a result, we should also scale down the default histogram bucket boundaries by a factor of 1000.
Note that scaled down boundaries are already in use, see for example semantic conventions for
http.server.duration
.Related issues