-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Delta" sums for UpDownCounter are error prone #725
"Delta" sums for UpDownCounter are error prone #725
Comments
… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]>
… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]>
…tails about the aggregation and temporality. (#182) * Change Metric type to be a more detailed structure with details about the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with #168 or #181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]> * Address feedback, added more TODOs and created issues Signed-off-by: Bogdan Drutu <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Aaron Abbott <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Fix indentation and comment for Type Signed-off-by: Bogdan Drutu <[email protected]> Co-authored-by: Aaron Abbott <[email protected]> Co-authored-by: Tyler Yahn <[email protected]>
I believe that "error prone" is a subjective assessment. If a system intends to use UpDownCounter events to measure a rate of change, then loss of deltas means simply "no data". If a system is using deltas for UpDownCounter events, they should derive rate information not absolute totals. I like to think of this as a meaningful vs. useful distinction. The use of deltas to convey UpDownCounter events is meaningful (and requires zero memory). It's only useful if you have reliable transport and/or are not deriving totals. That said I support open-telemetry/oteps#131 which states that the default should be cumulative. |
I think we decided this is not a great concern in the OTLP meeting today. I vote to close this issue. |
Currently there are some language implementations that "reset" the calculated sum (default aggregation for UpDownCounter) every export interval, and this can cause problems in real world.
The main idea comes from the fact that the consumer of these metrics are almost all the time interested in the final result of the Sum (not in changes over time) and when alerts are used for these metrics most of the time they will be configured based on the total/cumulative Sum. As examples from the UpDownCounter definition:
When exporting "deltas" or sums that are reset every export interval, the export pipeline becomes a single point of failure for the alerts, any dropped "delta" will influence the "current" value of the metric in an undefined way (may cause alerts that should fire to not fire, or alerts to fire when they should not). Because of this it is extremely important to send "Cumulative" or a.k.a current value for an UpDownCounter all the time, so that if the export pipeline drops one exported value the alerts will still function correctly.
Because of this argument, I think OpenTelemetry should try to "force" users to configure the library correctly by:
The text was updated successfully, but these errors were encountered: