From db9422d812a765f4c3ccd36c5329c1dc4613350a Mon Sep 17 00:00:00 2001 From: David Ashpole Date: Thu, 10 Aug 2023 17:20:45 -0400 Subject: [PATCH] Attribute sets not observed during async callbacks are not exported (#3242) --- CHANGELOG.md | 2 + specification/metrics/sdk.md | 4 + .../metrics/supplementary-guidelines.md | 134 ++++++++++-------- 3 files changed, 82 insertions(+), 58 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 96eb112f960..8bff2573b7e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,8 @@ release. ([#3648](https://github.com/open-telemetry/opentelemetry-specification/pull/3648)) - MetricReader.Collect ignores Resource from MetricProducer.Produce. ([#3636](https://github.com/open-telemetry/opentelemetry-specification/pull/3636)) +- Attribute sets not observed during async callbacks are not exported. + ([#3242](https://github.com/open-telemetry/opentelemetry-specification/pull/3242)) ### Logs diff --git a/specification/metrics/sdk.md b/specification/metrics/sdk.md index 8e348315d78..201282ba979 100644 --- a/specification/metrics/sdk.md +++ b/specification/metrics/sdk.md @@ -701,6 +701,10 @@ execution. The implementation MUST complete the execution of all callbacks for a given instrument before starting a subsequent round of collection. +The implementation SHOULD NOT produce aggregated metric data for a +previously-observed attribute set which is not observed during a successful +callback. + ### Cardinality limits **Status**: [Experimental](../document-status.md) diff --git a/specification/metrics/supplementary-guidelines.md b/specification/metrics/supplementary-guidelines.md index 1837274e9f5..d8f271f8854 100644 --- a/specification/metrics/supplementary-guidelines.md +++ b/specification/metrics/supplementary-guidelines.md @@ -431,25 +431,30 @@ Instrument](./api.md#histogram). What if we collect measurements from an [Asynchronous Counter](./api.md#asynchronous-counter)? The following example shows the number of [page -faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the -thread ever started: +faults](https://en.wikipedia.org/wiki/Page_fault) of each process since +it started: * During the time range (T0, T1]: - * pid = `1001`, tid = `1`, #PF = `50` - * pid = `1001`, tid = `2`, #PF = `30` + * pid = `1001`, #PF = `50` + * pid = `1002`, #PF = `30` * During the time range (T1, T2]: - * pid = `1001`, tid = `1`, #PF = `53` - * pid = `1001`, tid = `2`, #PF = `38` + * pid = `1001`, #PF = `53` + * pid = `1002`, #PF = `38` * During the time range (T2, T3] - * pid = `1001`, tid = `1`, #PF = `56` - * pid = `1001`, tid = `2`, #PF = `42` + * pid = `1001`, #PF = `56` + * pid = `1002`, #PF = `42` * During the time range (T3, T4]: - * pid = `1001`, tid = `1`, #PF = `60` - * pid = `1001`, tid = `2`, #PF = `47` + * pid = `1001`, #PF = `60` + * pid = `1002`, #PF = `47` * During the time range (T4, T5]: - * thread 1 died, thread 3 started - * pid = `1001`, tid = `2`, #PF = `53` - * pid = `1001`, tid = `3`, #PF = `5` + * process 1001 died, process 1003 started + * pid = `1002`, #PF = `53` + * pid = `1003`, #PF = `5` +* During the time range (T5, T6]: + * A new process 1001 started + * pid = `1001`, #PF = `10` + * pid = `1002`, #PF = `57` + * pid = `1003`, #PF = `8` Note that in the following examples, Cumulative aggregation temporality is discussed before Delta aggregation temporality because @@ -461,47 +466,56 @@ API with specified Cumulative aggregation temporality. If we export the metrics using **Cumulative Temporality**: * (T0, T1] - * attributes: {pid = `1001`, tid = `1`}, sum: `50` - * attributes: {pid = `1001`, tid = `2`}, sum: `30` + * attributes: {pid = `1001`}, sum: `50` + * attributes: {pid = `1002`}, sum: `30` * (T0, T2] - * attributes: {pid = `1001`, tid = `1`}, sum: `53` - * attributes: {pid = `1001`, tid = `2`}, sum: `38` + * attributes: {pid = `1001`}, sum: `53` + * attributes: {pid = `1002`}, sum: `38` * (T0, T3] - * attributes: {pid = `1001`, tid = `1`}, sum: `56` - * attributes: {pid = `1001`, tid = `2`}, sum: `42` + * attributes: {pid = `1001`}, sum: `56` + * attributes: {pid = `1002`}, sum: `42` * (T0, T4] - * attributes: {pid = `1001`, tid = `1`}, sum: `60` - * attributes: {pid = `1001`, tid = `2`}, sum: `47` + * attributes: {pid = `1001`}, sum: `60` + * attributes: {pid = `1002`}, sum: `47` * (T0, T5] - * attributes: {pid = `1001`, tid = `2`}, sum: `53` - * attributes: {pid = `1001`, tid = `3`}, sum: `5` + * attributes: {pid = `1002`}, sum: `53` +* (T4, T5] + * attributes: {pid = `1003`}, sum: `5` +* (T5, T6] + * attributes: {pid = `1001`}, sum: `10` +* (T0, T6] + * attributes: {pid = `1002`}, sum: `57` +* (T4, T6] + * attributes: {pid = `1003`}, sum: `8` The behavior in the first four periods is quite straightforward - we just take the data being reported from the asynchronous instruments and send them. -The data model prescribes several valid behaviors at T5 in -this case, where one stream dies and another starts. The [Resets and -Gaps](./data-model.md#resets-and-gaps) section describes how start -timestamps and staleness markers can be used to increase the +The data model prescribes several valid behaviors at T5 and +T6 in this case, where one stream dies and another starts. +The [Resets and Gaps](./data-model.md#resets-and-gaps) section describes +how start timestamps and staleness markers can be used to increase the receiver's understanding of these events. Consider whether the SDK maintains individual timestamps for the individual stream, or just one per process. In this example, where a -thread can die and start counting page faults from zero, the valid -behaviors at T5 are: +process can die and restart, it starts counting page faults from zero. +In this case, the valid behaviors at T5 and T6 +are: 1. If all streams in the process share a start time, and the SDK is not required to remember all past streams: the thread restarts with - zero sum. Receivers with reset detection are able to calculate a - correct rate (except for frequent restarts relative to the - collection interval), however the precise time of a reset will be - unknown. -2. If the SDK maintains per-stream start times, it signals to the - receiver precisely when a stream started, making the first - observation in a stream more useful for diagnostics. Receivers can - perform overlap detection or duplicate suppression and do not - require reset detection, in this case. + zero sum, and the start time of the process. Receivers with reset + detection are able to calculate a correct rate (except for frequent + restarts relative to the collection interval), however the precise + time of a reset will be unknown. +2. If the SDK maintains per-stream start times, it provides the previous + callback time as the start time, as this time is before the occurrence + of any events which are measured during the subsequent callback. This + makes the first observation in a stream more useful for diagnostics, + as downstream consumers can perform overlap detection or duplicate + suppression and do not require reset detection in this case. 3. Independent of above treatments, the SDK can add a staleness marker to indicate the start of a gap in the stream when one thread dies by remembering which streams have previously reported but are not @@ -519,20 +533,23 @@ data model. If we export the metrics using **Delta Temporality**: * (T0, T1] - * attributes: {pid = `1001`, tid = `1`}, delta: `50` - * attributes: {pid = `1001`, tid = `2`}, delta: `30` + * attributes: {pid = `1002`}, delta: `30` * (T1, T2] - * attributes: {pid = `1001`, tid = `1`}, delta: `3` - * attributes: {pid = `1001`, tid = `2`}, delta: `8` + * attributes: {pid = `1001`}, delta: `3` + * attributes: {pid = `1002`}, delta: `8` * (T2, T3] - * attributes: {pid = `1001`, tid = `1`}, delta: `3` - * attributes: {pid = `1001`, tid = `2`}, delta: `4` + * attributes: {pid = `1001`}, delta: `3` + * attributes: {pid = `1002`}, delta: `4` * (T3, T4] - * attributes: {pid = `1001`, tid = `1`}, delta: `4` - * attributes: {pid = `1001`, tid = `2`}, delta: `5` + * attributes: {pid = `1001`}, delta: `4` + * attributes: {pid = `1002`}, delta: `5` * (T4, T5] - * attributes: {pid = `1001`, tid = `2`}, delta: `6` - * attributes: {pid = `1001`, tid = `3`}, delta: `5` + * attributes: {pid = `1002`}, delta: `6` + * attributes: {pid = `1003`}, delta: `5` +* (T5, T6] + * attributes: {pid = `1001`}, delta: `10` + * attributes: {pid = `1002`}, delta: `4` + * attributes: {pid = `1003`}, delta: `3` You can see that we are performing Cumulative->Delta conversion, and it requires us to remember the last value of **every single permutation we've encountered so @@ -560,11 +577,10 @@ So here are some suggestions that we encourage SDK implementers to consider: ##### Asynchronous example: attribute removal in a view Suppose the metrics in the asynchronous example above are exported -through a view configured to remove the `tid` attribute, leaving a -single-dimensional count of page faults by `pid`. For each metric -stream, two measurements are produced covering the same interval of -time, which the SDK is expected to aggregate before producing the -output. +through a view configured to remove the `pid` attribute, leaving a +count of page faults. For each metric stream, two measurements are produced +covering the same interval of time, which the SDK is expected to aggregate +before producing the output. The data model specifies to use the "natural merge" function, in this case meaning to add the current point values together because they @@ -572,15 +588,17 @@ are `Sum` data points. The expected output is, still in **Cumulative Temporality**: * (T0, T1] - * dimensions: {pid = `1001`}, sum: `80` + * dimensions: {}, sum: `80` * (T0, T2] - * dimensions: {pid = `1001`}, sum: `91` + * dimensions: {}, sum: `91` * (T0, T3] - * dimensions: {pid = `1001`}, sum: `98` + * dimensions: {}, sum: `98` * (T0, T4] - * dimensions: {pid = `1001`}, sum: `107` + * dimensions: {}, sum: `107` * (T0, T5] - * dimensions: {pid = `1001`}, sum: `58` + * dimensions: {}, sum: `58` +* (T0, T6] + * dimensions: {}, sum: `75` As discussed in the asynchronous cumulative temporality example above, there are various treatments available for detecting resets. Even if