-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Structure and Continuity; remove instantaneous Temporality #181
Conversation
|
||
// Continuity describes how the measurement was captured, | ||
// indicating whether measurements represent application-level | ||
// events or were generated through a callback invoked by the SDK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is not only the callback that can generate snapshots, aggregating any already aggregated data is a snapshot. Think in collector I can receive points from a Counter aggregated as a sum/delta then building a Histogram there makes the input snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for a Counter (aggregating cumulative Sum) vs SumObserver (cumulative Sum calculated outside) - both sums were calculated by an aggregator (in our outside our library), and at the moment you export the sum there is minimal information that Snapshot vs Continuous provide - no more information about the number of measurements available. But in case of a Histogram where we also include the count of measurements this information may be useful (and probably is). So I am not 100% convinced that Continuity makes sense for scalar metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregating any already aggregated data is a snapshot
I would have stated that:
- aggregating already-aggregated SNAPSHOT data is SNAPSHOT data
- aggregating already-aggregated CONTINUOUS data is CONTINUOUS data
Combining multiple input metrics of different Continuity is not well defined in this (variation of the) protocol, and I'm not sure we should be aiming for a protocol that supports multiple input metrics regardless of whether they agree on Structure/Continuity. In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the moment you export the sum there is minimal information that Snapshot vs Continuous provide
My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.
There may be minimal information, but there is information here. If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.
A sum is a sum, and we know that it is sensible to compute a rate of a sum. A "gauge" (or non-sum) should not be exposed as a rate, because we don't know that the distance between A and B is the same as the distance between (A+x) and (B+x). This quality is the reason that Sums can be displayed as rates regardless of whether they are monotonic or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is that both of these fields tell us about the input measurements, which do not change when we aggregate.
I think this is not clear to me, will try to show an example to show where my confusion comes from:
- An "UpDownSumObserver" for "memory usage", and having an export interval of 10seconds, every 10 second I report the cumulative sum of all "new/delete" calls. User configures the library to just export the value read from the kernel with no changes.
- An "UpDownCounter" for "new/delete" calls, by instrumenting tcmalloc library. We have configured a cumulative sum aggregation (maybe without exemplars if that will be possible). Export interval is the same 10 seconds.
In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process. The only difference may be "exemplars" which may or may not be present for the second metric (not guaranteed that they are present, if the user configures to not keep exemplars in the view configuration).
Now the idea is, way do we need to know and complicate the model for this case with continuous
vs snapshot
? Both are snapshots
because even in the second metric we actually take a snapshot of the aggregator at a specific moment of time.
If you see two raw measurements with the same timestamp and they are SNAPSHOT, they were made by the same callback and can be compared as a ratio. If you see two raw measurements with the same timestamp and they are CONTINUOUS, they are merely coincidental and should not be compared as a ratio.
Here is where I agree with you, but I see things a bit different as proposed in #179, we have "Raw measurements" a.k.a CONTINUOUS
which represents measurements recorded using the sync instruments with no other aggregation and any other "measurements" are snapshot
independent of where that aggregation happens (in our library like in the case of the second metric, or outside our library in case of the first metric) because we actually take a snapshot of an Aggregator
.
Probably I miss something, but here is where my current understanding is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in my proposal to clarify "Raw Measurements" imply continuos all the other types imply snapshot because we take a snapshot of an Aggregator that somewhere exists (inside or outside of our library)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this two cases I get exactly the same data - every 10 second a scalar metric with the sum of new/delete from the beginning of the process
The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The count of label sets means something different in these cases. The Observer instrument produces N points at the same logical instant in time, whereas the Recorder instrument produces N points over a window of time. If I ask how many distinct label sets there are from a Recorder (CONTINUOUS), I have to specify a time window for the question to be meaningful. For an Observer (SNAPSHOT), I can answer the question ("how many distinct label sets") without specifying a time range.
I think this is where I miss something (maybe here is where the disagreement comes from). In both cases for me when I look at the N points (N different label sets) they accumulated (sums) values over that time interval:
- -------- // start time t0'
- t0 - new object with location "stack" size 10
- t1 - new object with location "heap" size 100
- t2 - delete object with location "stack" size 10
- t3 - new object with location "stack" size 20
- -------- // exporting t3'
- t4 - delete object with location "stack" size 20
- t5 - new object with location "stack" size 30
- t6 - delete object with location "heap" size 100
- t7 - new object with location "heap" size 200
- -------- // exporting t7'
- t8 - delete object with location "stack" size 30
- t9 - delete object with location "heap" size 200
- -------- // exporting t9'
First Metric:
- first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)
- second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)
- third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)
Second Metric:
- first export (start_time= t0'; end_time = t3' - "location"="stack" 20; "location"="heap" 100)
- second export (start_time= t0'; end_time = t7' - "location"="stack" 30; "location"="heap" 200)
- third export (start_time= t0'; end_time = t9' - "location"="stack" 0; "location"="heap" 0)
In both cases we have the same number of unique label sets, and we cannot assign a specific moment when a value for a specific label set was updated, because that information is lost during aggregation, the only thing we know is that in the interval t0' - tx' we had this number of distinct label sets and these are the "current values" corresponding to every label set.
To get a bit more confused, even for the "second metric" we actually read the value at one specific moment "exporting time", so why is that different? I think once we left the library by adding the data to the wire, we see these metrics exactly the same, inside the library there may be some interpretations like the one you had, but once we put the data on the wire they are identical to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood this:
In all cases, the presence of StartTimeUnixNano != 0 indicates that the input data spanned a period of time as opposed to being defined at an instant where StartTimeUnixNano == TimeUnixNano.
I am trying again to see if that makes sense:
- There are some Observers where we know the start time, for example a
SumObserver
may have the start time equal with the start of the process, and in this case their data will look exactly the same as the "UpDownCounter" example. In this case the metric is considered "CONTINUOUS". - There are some Observers where we cannot determine the start time correctly, even for a
SumObserver
, in this case the collected points will miss the start_time and hence they will be considered "SNAPSHOT".
If this is the example that you have in mind, I completely agree that this is a possibility in the real world, but this makes me to think that Observers especially "SumObserver" and "UpDownSumObserver" need to support a configuration where users will specify the start_time (or just a boolean to say if it is equal to the start of the process).
There is a part of this proposal that doesn't sit well with me--the distinction between raw snapshot and raw continuous measurements is arbitrary. In open-telemetry/oteps#88 we discussed a number of potential instruments that are sensible but not included in the API specification, those would have hypothetical names If we were to introduce these instruments, we'd need a way to distinguish whether a raw sum was entered as a change or as a sum, which is something that @bogdandrutu included under the |
… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]>
… the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with open-telemetry#168 or open-telemetry#181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]>
See #182 |
…tails about the aggregation and temporality. (#182) * Change Metric type to be a more detailed structure with details about the aggregation and temporality. This PR takes a more descriptive approach where the Aggregation (the last applied aggregation if any) is clearly defined and Temporality is available only where it makes sense. This can help clearly identify what are the possible values and properties for different types of Aggregations. There are some things that can be considered (TODOs left in the code): * open-telemetry/opentelemetry-specification#725 * Histogram/Summary sum - monotonicity? * Previous aggregation measurements: raw measurements or "derived" measurements (results of another aggregation). * Support for RawMeasurements (recorded via the Sync Instruments) * Decide if support for INSTANTANEOUS temporality is still needed. IMPORTANT: * This PR is not equivalent with #168 or #181, this is inspired by these PRs but makes some changes that are not compatible with that PR. * This PR is an incremental update, without any significant semantic changes (except adding the notion of GAUGE), from the current state, more changes will be added in the future to resolve the TODOs. Signed-off-by: Bogdan Drutu <[email protected]> * Address feedback, added more TODOs and created issues Signed-off-by: Bogdan Drutu <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Aaron Abbott <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Update opentelemetry/proto/metrics/v1/metrics.proto Co-authored-by: Tyler Yahn <[email protected]> * Fix indentation and comment for Type Signed-off-by: Bogdan Drutu <[email protected]> Co-authored-by: Aaron Abbott <[email protected]> Co-authored-by: Tyler Yahn <[email protected]>
This is simplified from #168 by removing INSTANTANEOUS Temporality and making Structure and Continuity independent fields, yielding 12 combinations of Temporality, Structure, and Continuity.
Like #168, this moves Monotonicity into a Structure value, removes it from Type.
The removal of INSTANTANEOUS is accommodated by making StartTimeUnixNano optional when data is unaggregated. Raw measurements correspond to the API specification for raw data, meaning: