Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to rebuild from deltas? #1566

Closed
rakyll opened this issue Mar 20, 2021 · 3 comments · Fixed by #1618
Closed

How to rebuild from deltas? #1566

rakyll opened this issue Mar 20, 2021 · 3 comments · Fixed by #1618
Assignees
Labels
area:data-model For issues related to data model release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory

Comments

@rakyll
Copy link
Contributor

rakyll commented Mar 20, 2021

An earlier discussion about rebuilding from deltas raised a question: open-telemetry/prometheus-interoperability-spec#25 (comment).

Currently, the delta data points don't have sequence numbers. There is no way to identify duplicates or missing data points. Generally speaking, we don't know how to rebuild from deltas without sequence numbers. The spec/data model should address this issue.

cc @RichiH @jmacd

@rakyll rakyll added the spec:metrics Related to the specification/metrics directory label Mar 20, 2021
@RichiH
Copy link
Member

RichiH commented Mar 22, 2021

This is a recurring discussion from the days when OpenMetrics and OpenCensus tried to merge.

Sequence numbers allow for detection, but not correction, of missing information.

A rough(!) equivalent would be UDP (sequence only) vs TCP (send windows, ACKs, retransmissions, state per receiver, local deletion after last ACK, etc.). This is significant overhead compared to plain deltas, so it comes down to design goals and constraints. As far as I know, data loss is not acceptable within OTel, so some TCP-like mechanism would be needed. If data stability can be reduced, the additional complexity could be reduced accordingly.

An in-between would be to rebuild the cumulative state directly at the emitter, which lead the discussion in early 2020 back to square one: that overall complexity would be lowest if the internal representation was cumulatives, not deltas, by default.

I don't know off-hand if different receivers may request different delta periods. If that's the case, the system above would need to carry several state sets, one per delta time range. The same is true for any node where deltas can rest, or are recomputed/regrouped/cached, in the overall pipeline graph.

@reyang reyang added the area:data-model For issues related to data model label Mar 22, 2021
@rakyll
Copy link
Contributor Author

rakyll commented Mar 31, 2021

cc @jrcamp @jack-berg

@jsuereth
Copy link
Contributor

jsuereth commented Apr 2, 2021

I'd like to split out a few things to identify/resolve:

  1. Delta-Based metrics imported via a collector from a non-OTel source (e.g. StatsD receiver). For these metrics we are limited regarding solutions (e.g. sequence numbers). While I think we should do our best to support this use case, if we can't guarantee 100% consistency with OpenMetrics (or we have some failure scenarios around missing counts) I think this could be acceptable. I.e. I think we should consider this use-case "Best Effort", and am happy to document this scenario, but I don't think we should aim to provide guaranteed transmission on a protocol that wasn't designed with this in mind.
  2. Delta-based metrics coming out of an SDK intended for a cumulative-backend (e.g. OTLP => OM). This relates to Changing OTLP exporters to export cumulative metrics by default #731 and I hope our answer is "It's easy to export cumulative or delta metrics out of SDKs" once Changing OTLP exporters to export cumulative metrics by default #731 is resolved and I'd like to push the discussion around this portion there.
  3. Delta-Based metrics inside OTLP for which we can't force cumulative-at-SDK time. Here is the crux of the issue as reported. Assuming we have OTLP-push data for Delta sums, how do we convert to a Cumulative for OM export. In this case, I think you outlined the right questions:
    • Can we detect missing data points and/or duplicates? (e.g. do we want sequence numbers on deltas?)
    • How do we resolve missing data points/duplicates? (e.g. reset the counter?)

Given point 2, I think point 3 is a lower priority issue. That said, I think we have some answers straight away.

First, regarding seqeunce numbers. Assuming the #1574 and the Single-Writer philosophy, we should be able to use timestamp + aggregation temporality to uniquely identify a delta within OTLP (see https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L279)

Second, we do need to document what to do when we detect an out-of-order delta sum. In this case my proposal would be to reset the counter on detection. I'm reading @RichiH H's comment as "this is an ok thing to do, but not ideal' and I agree. If you we know we're outputing cumulative metrics, users should use the result of #731. In scenarios where that's not practical, this is the best we can (likely) do. If folks agree, I can write this up into the data model specification.

@jsuereth jsuereth self-assigned this Apr 2, 2021
@jsuereth jsuereth added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants