Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt default histogram boundaries to seconds as the new base unit #3509

Closed

Conversation

fstab
Copy link
Member

@fstab fstab commented May 16, 2023

Changes

There was a decision by the TC (technical committee) to use seconds as the base unit in OpenTelemetry rather than milliseconds.

As a result, we should also scale down the default histogram bucket boundaries by a factor of 1000.

Note that scaled down boundaries are already in use, see for example semantic conventions for http.server.duration.

Related issues

@fstab fstab requested review from a team May 16, 2023 10:41
@pirgeo
Copy link
Member

pirgeo commented May 16, 2023

I don't think we can do that, since we marked the defaults for the explicit bucket histogram as stable. We were able to change it for http.server.duration because the semantic conventions themselves are not marked stable yet.

Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These buckets are used for more than measurements of time. Changing their value is not motivated.

@fstab
Copy link
Member Author

fstab commented May 19, 2023

since we marked the defaults for the explicit bucket histogram as stable

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

are used for more than measurements of time

I'm not sure about this argument. The current defaults were explicitly defined for representing durations in milliseconds (they were copied from Prometheus' defaults for seconds and scaled by a factor or 1000).

@MrAlias
Copy link
Contributor

MrAlias commented May 19, 2023

The current defaults were explicitly defined for representing durations..

"Designed for" != "Used for"

@pirgeo
Copy link
Member

pirgeo commented May 22, 2023

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

I am now not sure anymore about the outcome of that discussion. My understanding was that changing the default buckets for the Histogram instrument itself is a breaking change, but changing the defaults for certain semantic conventions is not. This is because the Histogram spec is marked stable, but the semantic conventions are not. IIRC, that was our "out" in #2977 for being compatible but also not breaking the specification.

All the semconv metrics recorded with OTel should be compatible with Prometheus, as the defaults are now in seconds. The hints/advice API gives you a way to set these defaults. Instrumentations that want to be compliant with the semantic convention specification must export these metrics in seconds. For custom metrics outside the semantic conventions, users would probably want to set up "the right buckets" themselves (or use exponential histograms). They are the domain experts. My feeling is that the defaults we have today are very useful to many people in many situations, and should not be changed, even if it turns out not to be a breaking change.

@jsuereth
Copy link
Contributor

I thought that changing default bucket boundaries is not considered a breaking change in the OTel Spec, but I might be wrong here.

Changing bucket boundaries is not considered a breaking change for instrumentation (this is also true in prometheus land). It does alter error rates, but should not alter interaction.

That said, changing unit does break downstream usage, so perhaps there was a conflation of concerns there.

@fstab
Copy link
Member Author

fstab commented May 26, 2023

Just to clarify: This is about changing the default buckets, not about changing the unit.

If a user wants to measure latencies of a custom business transaction, the current OpenTelemetry default behavior is as follows:

  • Unit is seconds.
  • Buckets are 0, 5s, 10s, 25s, 50s, 75s, 100s, 250s, 500s, 750s, 1000s, 2500s, 5000s, 7500s, 10000s.

So all observations < 5s will end up in the same bucket. This will make the default histogram useless in many cases.

I agree that you can come up with scenarios where the current buckets are useful, but I think the most common usage of histograms is to measure latencies, and the default should be reasonable buckets for measuring latencies in seconds.

@pirgeo
Copy link
Member

pirgeo commented May 30, 2023

It does alter error rates, but should not alter interaction.

Not sure I follow. Could you elaborate on this a little more? Do you mean if all of the data falls in the 10-Infinity bucket for second-aligned buckets, you could still query for "how many requests complete within <250 (previously milliseconds, but now seconds)"? The result would obviously have massive errors attached to it, but the query itself would not fail.

For someone who is using a stable metrics SDK today, recording business transaction latencies in milliseconds (since it was the default before) this would break their stuff, right? They have their alerts configured to trigger at 250ms, relying on the fact that the SDK provides consistent data. Their backend might not do automatic unit conversion or not take the unit into account at all. Then they upgrade to a later version of the metrics SDK that implements this updated spec, and all of a sudden all of their data ends up in the bucket >10 since the way the data is calculated did not change. I guess this is dependent on the backend implementation, but either the alerts now trigger all the time or never. Neither is good, and it requires the user to change code in order to upgrade the SDK, which feels like a breaking change to me - but please correct me if I am wrong here. I could also imagine the person seeing the alerts might not be the same person that did the SDK upgrade. If we don't change the defaults, users can go in in a coordinated way whenever they have the time, decide what unit makes the most sense for their use-case, switch the buckets over seconds-based buckets (if they decide that is the way to go), update how the data is collected (or, more realistically, multiply the value that they collected in milliseconds to get to seconds), and set up their alerts while they are touching the code anyway.

I feel like I am missing an important piece of the puzzle here that seems clear to everyone else 😅

For semantic conventions, we don't have this problem. We can switch over the instruments that collect semantic conventions to seconds-based buckets, and when user dashboard/alerts break, we can point to the notice saying "Don't rely on the semantic conventions yet, they were not marked as stable". These semantic conventions actually report data in seconds from now on.

@fstab
Copy link
Member Author

fstab commented May 30, 2023

Yes, the current buckets make sense for measuring latencies in milliseconds. However, OpenTelemetry decided to switch to seconds, and I think the default should not be "milliseconds for custom metrics" and "seconds for semantic conventions".

@pirgeo
Copy link
Member

pirgeo commented May 30, 2023

I get that, my point is: The Metrics SDK specification containing the boundaries is marked stable. Users are relying on the default buckets defined in the stable spec for their custom metrics. If the spec changes the defaults, SDK authors follow the specification and release a new version with updated buckets. After updating the SDK, the user's histograms can't be used for estimations anymore because everything ends up in one bucket. This happens even though everything that user used was marked stable and no breaking change was indicated. I think it is a breaking change, but @jsuereth pointed out that it is not, IIUC (?)

Finally, users might actually have consciously decided for going with milliseconds (or some other unit that aligns well with these buckets) because it is the best fit for their metric or they don't care about Prometheus because they don't use it - this change would force them to change their buckets manually to what they assumed was a good and stable default.

@JamesNK
Copy link

JamesNK commented Jun 1, 2023

I agree with this PR. I created an issue and then noticed this PR - #3525

There are several options:

  • Good: Recommending seconds for durations and changing the default boundaries. Completely aligns with Prometheus.
  • Good: Reverting back to recommending milliseconds for durations and leaving the boundaries as they are today.
  • Bad: Recommending seconds for durations and leaving the default boundaries configured for milliseconds.

Unfortunately, OTel currently is in the bad state.

Are https://github.com/open-telemetry/opentelemetry-specification and https://github.com/open-telemetry/semantic-conventions different people? I think some communication is required to figure out how to get back to a good state.

@jack-berg
Copy link
Member

@JamesNK / @fstab have you seen #2977? Contains a lot of context around this issue. Summary is:

  • Defaulting to second for durations is desirable for alignment with prometheus.
  • We'd like to change the default explicit bucket boundaries such that their sensible for measuring client / server response durations, but there isn't any consensus about whether this is an allowed change given that section of the metrics SDK document is stable.
  • The advice API was recently added which allows instrumentation to provide a recommendation to SDKs on useful default bucket boundaries.
  • The TC voted on this and decided to change the default duration unit to seconds. The expectation is that the advice API will be used for instrumentation recording http.[client|server].duration to specify bucket boundaries that are useful when measurements are recorded in seconds.

Unfortunately, OTel currently is in the bad state.

Its not ideal, but with the advice API it should be ok. The default bucket boundaries were only ever going to be useful for a pretty particular use case. Instrumentation can use advice to ensure resulting bucket boundaries are useful out of the box.

@github-actions
Copy link

github-actions bot commented Jun 9, 2023

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 9, 2023
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants