Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Delta" sums for UpDownCounter are error prone #725

Closed
bogdandrutu opened this issue Jul 21, 2020 · 2 comments · Fixed by open-telemetry/oteps#131 or open-telemetry/opentelemetry-dotnet-contrib#734
Labels
bug Something isn't working priority:p2 Medium priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory

Comments

@bogdandrutu
Copy link
Member

Currently there are some language implementations that "reset" the calculated sum (default aggregation for UpDownCounter) every export interval, and this can cause problems in real world.

The main idea comes from the fact that the consumer of these metrics are almost all the time interested in the final result of the Sum (not in changes over time) and when alerts are used for these metrics most of the time they will be configured based on the total/cumulative Sum. As examples from the UpDownCounter definition:

  • count the number of active requests - An alert will be set for "current number of active requests > 100"
  • count memory in use by instrumenting new and delete - An alert will be set for "current memory usage > 10GB"
  • count queue size by instrumenting enqueue and dequeue - An alert will be set for "current queue size > 1000"
  • count semaphore up and down operations - Not sure if an alert is necessary important for this, but will most likely be using the "current" value of the semaphore.

When exporting "deltas" or sums that are reset every export interval, the export pipeline becomes a single point of failure for the alerts, any dropped "delta" will influence the "current" value of the metric in an undefined way (may cause alerts that should fire to not fire, or alerts to fire when they should not). Because of this it is extremely important to send "Cumulative" or a.k.a current value for an UpDownCounter all the time, so that if the export pipeline drops one exported value the alerts will still function correctly.

Because of this argument, I think OpenTelemetry should try to "force" users to configure the library correctly by:

  1. Disabling the ability to compute delta sums for UpDownCounters and always export cumulative (current value).
  2. Remove support for delta sums from the protocol.
@bogdandrutu bogdandrutu added bug Something isn't working spec:metrics Related to the specification/metrics directory release:required-for-ga Must be resolved before GA release, or nice to have before GA priority:p2 Medium priority level labels Jul 21, 2020
bogdandrutu added a commit to bogdandrutu/opentelemetry-proto that referenced this issue Jul 29, 2020
… the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)

IMPORTANT:
* This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state,
  more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>
bogdandrutu added a commit to bogdandrutu/opentelemetry-proto that referenced this issue Jul 29, 2020
… the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)
* Decide if support for INSTANTANEOUS temporality is still needed.

IMPORTANT:
* This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>
bogdandrutu added a commit to open-telemetry/opentelemetry-proto that referenced this issue Jul 30, 2020
…tails about the aggregation and temporality. (#182)

* Change Metric type to be a more detailed structure with details about the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)
* Decide if support for INSTANTANEOUS temporality is still needed.

IMPORTANT:
* This PR is not equivalent with #168 or #181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>

* Address feedback, added more TODOs and created issues

Signed-off-by: Bogdan Drutu <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Aaron Abbott <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Fix indentation and comment for Type

Signed-off-by: Bogdan Drutu <[email protected]>

Co-authored-by: Aaron Abbott <[email protected]>
Co-authored-by: Tyler Yahn <[email protected]>
@jmacd
Copy link
Contributor

jmacd commented Aug 11, 2020

I believe that "error prone" is a subjective assessment. If a system intends to use UpDownCounter events to measure a rate of change, then loss of deltas means simply "no data". If a system is using deltas for UpDownCounter events, they should derive rate information not absolute totals.

I like to think of this as a meaningful vs. useful distinction. The use of deltas to convey UpDownCounter events is meaningful (and requires zero memory). It's only useful if you have reliable transport and/or are not deriving totals.

That said I support open-telemetry/oteps#131 which states that the default should be cumulative.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

I think we decided this is not a great concern in the OTLP meeting today. I vote to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Medium priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory
Projects
None yet
2 participants