Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Structure and Continuity; remove instantaneous Temporality #181

Closed
wants to merge 5 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 28, 2020

This is simplified from #168 by removing INSTANTANEOUS Temporality and making Structure and Continuity independent fields, yielding 12 combinations of Temporality, Structure, and Continuity.

Like #168, this moves Monotonicity into a Structure value, removes it from Type.

The removal of INSTANTANEOUS is accommodated by making StartTimeUnixNano optional when data is unaggregated. Raw measurements correspond to the API specification for raw data, meaning:

  • CONTINUOUS ADDING(_MONTONIC) raw measurements are changes to a sum
  • SNAPSHOT ADDING(_MONTONIC) raw measurements are direct sums


// Continuity describes how the measurement was captured,
// indicating whether measurements represent application-level
// events or were generated through a callback invoked by the SDK.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not only the callback that can generate snapshots, aggregating any already aggregated data is a snapshot. Think in collector I can receive points from a Counter aggregated as a sum/delta then building a Histogram there makes the input snapshot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for a Counter (aggregating cumulative Sum) vs SumObserver (cumulative Sum calculated outside) - both sums were calculated by an aggregator (in our outside our library), and at the moment you export the sum there is minimal information that Snapshot vs Continuous provide - no more information about the number of measurements available. But in case of a Histogram where we also include the count of measurements this information may be useful (and probably is). So I am not 100% convinced that Continuity makes sense for scalar metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregating any already aggregated data is a snapshot

I would have stated that:

  • aggregating already-aggregated SNAPSHOT data is SNAPSHOT data
  • aggregating already-aggregated CONTINUOUS data is CONTINUOUS data

Combining multiple input metrics of different Continuity is not well defined in this (variation of the) protocol, and I'm not sure we should be aiming for a protocol that supports multiple input metrics regardless of whether they agree on Structure/Continuity. In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment you export the sum there is minimal information that Snapshot vs Continuous provide

My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.

There may be minimal information, but there is information here. If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.

A sum is a sum, and we know that it is sensible to compute a rate of a sum. A "gauge" (or non-sum) should not be exposed as a rate, because we don't know that the distance between A and B is the same as the distance between (A+x) and (B+x). This quality is the reason that Sums can be displayed as rates regardless of whether they are monotonic or not.

Copy link
Member

@bogdandrutu bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.

I think this is not clear to me, will try to show an example to show where my confusion comes from:

  1. An "UpDownSumObserver" for "memory usage", and having an export interval of 10seconds, every 10 second I report the cumulative sum of all "new/delete" calls. User configures the library to just export the value read from the kernel with no changes.
  2. An "UpDownCounter" for "new/delete" calls, by instrumenting tcmalloc library. We have configured a cumulative sum aggregation (maybe without exemplars if that will be possible). Export interval is the same 10 seconds.

In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process. The only difference may be "exemplars" which may or may not be present for the second metric (not guaranteed that they are present, if the user configures to not keep exemplars in the view configuration).

Now the idea is, way do we need to know and complicate the model for this case with continuous vs snapshot? Both are snapshots because even in the second metric we actually take a snapshot of the aggregator at a specific moment of time.

If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.

Here is where I agree with you, but I see things a bit different as proposed in #179, we have "Raw measurements" a.k.a CONTINUOUS which represents measurements recorded using the sync instruments with no other aggregation and any other "measurements" are snapshot independent of where that aggregation happens (in our library like in the case of the second metric, or outside our library in case of the first metric) because we actually take a snapshot of an Aggregator.

Probably I miss something, but here is where my current understanding is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my proposal to clarify "Raw Measurements" imply continuos all the other types imply snapshot because we take a snapshot of an Aggregator that somewhere exists (inside or outside of our library)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process

The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.

Copy link
Member

@bogdandrutu bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.

I think this is where I miss something (maybe here is where the disagreement comes from). In both cases for me when I look at the N points (N different label sets) they accumulated (sums) values over that time interval:

  • -------- // start time t0'
  • t0 - new object with location "stack" size 10
  • t1 - new object with location "heap" size 100
  • t2 - delete object with location "stack" size 10
  • t3 - new object with location "stack" size 20
  • -------- // exporting t3'
  • t4 - delete object with location "stack" size 20
  • t5 - new object with location "stack" size 30
  • t6 - delete object with location "heap" size 100
  • t7 - new object with location "heap" size 200
  • -------- // exporting t7'
  • t8 - delete object with location "stack" size 30
  • t9 - delete object with location "heap" size 200
  • -------- // exporting t9'

First Metric:

  • first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)
  • second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)
  • third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)

Second Metric:

  • first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)
  • second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)
  • third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)

In both cases we have the same number of unique label sets, and we cannot assign a specific moment when a value for a specific label set was updated, because that information is lost during aggregation, the only thing we know is that in the interval t0' - tx' we had this number of distinct label sets and these are the "current values" corresponding to every label set.

To get a bit more confused, even for the "second metric" we actually read the value at one specific moment "exporting time", so why is that different? I think once we left the library by adding the data to the wire, we see these metrics exactly the same, inside the library there may be some interpretations like the one you had, but once we put the data on the wire they are identical to me.

Copy link
Member

@bogdandrutu bogdandrutu Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood this:

In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.

I am trying again to see if that makes sense:

  1. There are some Observers where we know the start time, for example a SumObserver may have the start time equal with the start of the process, and in this case their data will look exactly the same as the "UpDownCounter" example. In this case the metric is considered "CONTINUOUS".
  2. There are some Observers where we cannot determine the start time correctly, even for a SumObserver, in this case the collected points will miss the start_time and hence they will be considered "SNAPSHOT".

If this is the example that you have in mind, I completely agree that this is a possibility in the real world, but this makes me to think that Observers especially "SumObserver" and "UpDownSumObserver" need to support a configuration where users will specify the start_time (or just a boolean to say if it is equal to the start of the process).

@jmacd
Copy link
Contributor Author

jmacd commented Jul 28, 2020

There is a part of this proposal that doesn't sit well with me--the distinction between raw snapshot and raw continuous measurements is arbitrary. In open-telemetry/oteps#88 we discussed a number of potential instruments that are sensible but not included in the API specification, those would have hypothetical names SumRecorder, UpDownSumRecorder (CONTINUOUS) and DeltaObserver, UpDownDeltaObserver (SNAPSHOT) in current terms. I'd like it if OTLP supported this kind of data, even if the API doesn't expose such instruments.

If we were to introduce these instruments, we'd need a way to distinguish whether a raw sum was entered as a change or as a sum, which is something that @bogdandrutu included under the oneof in #179, as this seems to be independent of Temporality in the case where there is no temporal aggregation. I was thinking of a new type-related field to indicate this property, but can't find the right name.

bogdandrutu added a commit to bogdandrutu/opentelemetry-proto that referenced this pull request Jul 29, 2020
… the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)

IMPORTANT:
* This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state,
  more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>
bogdandrutu added a commit to bogdandrutu/opentelemetry-proto that referenced this pull request Jul 29, 2020
… the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)
* Decide if support for INSTANTANEOUS temporality is still needed.

IMPORTANT:
* This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>
@jmacd
Copy link
Contributor Author

jmacd commented Jul 29, 2020

See #182

@jmacd jmacd closed this Jul 29, 2020
bogdandrutu added a commit that referenced this pull request Jul 30, 2020
…tails about the aggregation and temporality. (#182)

* Change Metric type to be a more detailed structure with details about the aggregation and temporality.

This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations.

There are some things that can be considered (TODOs left in the code):

* open-telemetry/opentelemetry-specification#725
* Histogram/Summary sum - monotonicity?
* Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation).
* Support for RawMeasurements (recorded via the Sync Instruments)
* Decide if support for INSTANTANEOUS temporality is still needed.

IMPORTANT:
* This PR is not equivalent with #168 or #181, this is inspired by these PRs but makes some changes that are not compatible with that PR.
* This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs.

Signed-off-by: Bogdan Drutu <[email protected]>

* Address feedback, added more TODOs and created issues

Signed-off-by: Bogdan Drutu <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Aaron Abbott <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Update opentelemetry/proto/metrics/v1/metrics.proto

Co-authored-by: Tyler Yahn <[email protected]>

* Fix indentation and comment for Type

Signed-off-by: Bogdan Drutu <[email protected]>

Co-authored-by: Aaron Abbott <[email protected]>
Co-authored-by: Tyler Yahn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants