-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First cut at single writer principle. #1574
Changes from 9 commits
ca1d883
0292a4c
b026e90
9be25d0
1660131
f356e15
38cb301
11634c6
aebd668
8052ffe
d45a23f
588b5f7
ccd22f7
c89a6d7
0c1cceb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,16 +34,16 @@ Prometheus Remote Write protocol without loss of features or semantics, through | |
well-defined translations of the data, including the ability to automatically | ||
remove attributes and lower histogram resolution. | ||
|
||
## Events → Data → Timeseries | ||
## Events → Data Stream → Timeseries | ||
|
||
The OTLP Metrics protocol is designed as a standard for transporting metric | ||
data. To describe the intended use of this data and the associated semantic | ||
meaning, OpenTelemetry metric data types will be linked into a framework | ||
meaning, OpenTelemetry metric data stream types will be linked into a framework | ||
containing a higher-level model, about Metrics APIs and discrete input values, | ||
and a lower-level model, defining the Timeseries and discrete output values. | ||
The relationship between models is displayed in the diagram below. | ||
|
||
![Events → Data → Timeseries Diagram](img/model-layers.png) | ||
![Events → Data Stream → Timeseries Diagram](img/model-layers.png) | ||
|
||
This protocol was designed to meet the requirements of the OpenCensus Metrics | ||
system, particularly to meet its concept of Metrics Views. Views are | ||
|
@@ -67,9 +67,9 @@ collector. These transformations are: | |
allows downstream services to bear the cost of conversion into cumulative | ||
timeseries, or to forego the cost and calculate rates directly. | ||
|
||
OpenTelemetry Metrics data points are designed so that these transformations can | ||
be applied automatically to points of the same type, subject to conditions | ||
outlined below. Every OTLP data point has an intrinsic | ||
OpenTelemetry Metrics data streams are designed so that these transformations | ||
can be applied automatically to streams of the same type, subject to conditions | ||
outlined below. Every OTLP data stream has an intrinsic | ||
[decomposable aggregate function](https://en.wikipedia.org/wiki/Aggregate_function#Decomposable_aggregate_functions) | ||
making it semantically well-defined to merge data points across both temporal | ||
and spatial dimensions. Every OTLP data point also has two meaningful timestamps | ||
|
@@ -139,10 +139,10 @@ in scope for key design decisions: | |
OpenTelemetry fragments metrics into three interacting models: | ||
|
||
- An Event model, representing how instrumentation reports metric data. | ||
- A TimeSeries model, representing how backends store metric data. | ||
- The *O*pen*T*e*L*emetry *P*rotocol (OTLP) data model representing how metrics | ||
are manipulated and transmitted between the Event model and the TimeSeries | ||
storage. | ||
- A Timeseries model, representing how backends store metric data. | ||
- A Metric Stream model, defining the *O*pen*T*e*L*emetry *P*rotocol (OTLP) | ||
representing how metric data streams are manipulated and transmitted between | ||
the Event model and the Timeseries storage. | ||
|
||
### Event Model | ||
|
||
|
@@ -199,7 +199,24 @@ further development of the correspondence between these models. | |
|
||
### OpenTelemetry Protocol data model | ||
|
||
The OpenTelemetry data model for metrics includes four basic point kinds, all of | ||
The OpenTelmetry protocol data model is composed of Metric data streams. These | ||
streams are in turn composed of metric data points. Metric data streams | ||
can be converted directly into Timeseries, and share the same identity | ||
characteristics for a Timeseries. A metric stream is identified by: | ||
|
||
- The originating `Resource` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this a requirement? Sometimes I want resource dimensions to be present in a time series, sometimes I don't, it depends on the use case and the cost of cardinality. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you're taking the view that a TimeSeries generated from OTLP doesn't need the resource. That's completely fine, we're documenting what a Metric Data Stream is and how it's identified. Backends can map these to timeseries however they wish (as is done today), albeit OTel should come with a recommended conversion. What this specification means is that within OTLP you'll have a resource that's part of your identity (which is inherently true by protocol structure). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are trying to say that points with different resources belong to different streams, even when all of the other identifying properties are identical. Resource qualifies as identifying, whereas Unit does not qualify as such. Using the reasoning in the previous paragraph to explain this: points with different units, with all else equal, are considered an error. They are not distinct streams. I tried to apply the same reasoning to data point kind, but ran into trouble. Two points with different kinds, but all else equal, should be considered an error, not distinct streams. |
||
- The metric stream's `name`. | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- The attached `Attribute`s | ||
- The metric stream's point kind. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the point kind really intended to be part of the identity? Is it possible to have Timeseries with the exact same set of the above 3 elements and only differing in the point kind? Do any/most metric backends that exist today support this notion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this heavily in the SiG. TL;DR: it's likely an error scenario/misbehaving system that would have the same timeseries with different point types. For the purposes of OTel + Aggregation we do NOT unify them and leave that option to backends to determine the right thing to do. This isn't about preventing backends from supporting it, but allowing them to do so and not forcing OTel to specify some kind of behavior that works with backends which do not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
On the other hand, if we explicitly declare that this is a valid situation OpenTelemetry-compliant metric sources may start emitting it causing problems for backends that don't expect to see such data. I think this is the case when we need to have a stance and that stance has a consequence for backends. We either explicitly prohibit Timeseries with identical (resource,name,attributes) and different data points or we explicitly allow it, which means backend have to support it otherwise they can't accept OpenTelemetry-compliant data. I can see how allowing this can be valuable (e.g. I may export a gauge and a histogram for the same instrument with exact same name). My concern here is that if we allow this we may suddenly put vendors in a situation where they can't really accept this data. Just to be clear: I don't know if this is really a problem for backends. If you know that it is commonly accepted in the industry to emit such timeseries, then I don't see a problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On thing I'm not sure I made clear (and it was crystal clear in the discussion).
TL;DR; This is an error state, we just describe here how the "model" treats that error scenario so progress can be made and the best component has a chance to give the best error message. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Our stance for the Collector so far has been that by default it is not in the business of interpreting data or "treating" it in any way. It only does so when explicitly configured by the user to do so, or when a particular non-OTLP format is used and a translation must happen. Do you have a different expectation from the Collector?
Sounds good, makes sense to me. I may have missed it, but I do not see this in the text. Can we explicitly call out that this is an error state in the text? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think some of the built-in operations (label removal, aggregations, etc.) that are available in the baseline image should be clearly specified against the Data Model. Yes, I agree users configure these things, but that doesn't mean we don't clearly outline what they do and ensure their "compatibility" in shape to lead to the least-possible issues with exporters. Specifically: I think some generally-available collector operations should be specified in a way that they can be used without causing issues across Exporters. If users provide their own processors that do crazier things outside this model, that's fine and outside any guarantees.
Thought that I had, let me add that now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with everything being said here. Different point kinds should be considered errors. I agree that the collector's default behavior should be to simply pass the data through. We are trying to lay out correctness conditions for aggregating data and want definitions that help aggregators do the Right Thing when these errors arise. It seems to me that metric name, the resource, and metric attributes are first class identifiers for metric streams. The point kind and unit are somehow in a different category: these should be treated as distinct for the purposes of pass-through data processing, but should be treated as errors for aggregation. I'm not sure we have the proper terminology to describe properties of a metric stream that are identifying but erroneous vs identifying and compatible. Also I think we may be distracted by trying to define what is and is not an error, where we were trying to define what is a stream and its single-writer property. We may have a set of individually valid streams with single writers, that when considered together form an erroneous condition. Example: when two streams have the same name and resource, but distinct point kind and label values. Each stream is valid, but we should never see a mixture of point kinds for a metric name, independent of how many streams it writes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: if we get multiple point kinds with same metric name, how would we know which stream is erroneous? Do we need to register the point-kind + metric name prior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @victlu It's likely impossible for 'open telemetry' to know which is erroneous. a backend that supports schema'd metrics could use the currently registered metric schema to determine which one is 'stale' and ingest the correct one, but the key here is this is an "error' scenario. |
||
|
||
It is possible (and likely) that more than one metric stream is created per | ||
`Instrument` in the event model. | ||
|
||
__Note: The same `Resource`, `name` and `Attribute`s but differing point kind | ||
coming out of an OpenTelemetry SDK is considered an "error state" that should | ||
be handled by an SDK.__ | ||
|
||
A metric stream can use one of four basic point kinds, all of | ||
which satisfy the requirements above, meaning they define a decomposable | ||
aggregate function (also known as a “natural merge” function) for points of the | ||
same kind. <sup>[1](#otlpdatapointfn)</sup> | ||
|
@@ -211,11 +228,12 @@ The basic point kinds are: | |
3. Gauge | ||
4. Histogram | ||
|
||
Comparing the OpenTelemetry and Timeseries data models, OTLP carries an | ||
additional kind of point. Whereas an OTLP Monotonic Sum point translates into a | ||
Timeseries Counter point, and an OTLP Histogram point translates into a | ||
Timeseries Histogram point, there are two OTLP data points that become Gauges | ||
in the Timeseries model: the OTLP Non-Monotonic Sum point and OTLP Gauge point. | ||
Comparing the OTLP Metric Data Stream and Timeseries data models, Metric stream | ||
carries an additional kind of point. Whereas an OTLP Monotonic Sum point | ||
translates into a Timeseries Counter point, and an OTLP Histogram point | ||
translates into a Timeseries Histogram point, there are two OTLP data points | ||
that become Gauges in the Timeseries model: the OTLP Non-Monotonic Sum point | ||
and OTLP Gauge point. | ||
|
||
The two points that become Gauges in the Timeseries model are distinguished by | ||
their built in aggregate function, meaning they define re-aggregation | ||
|
@@ -224,11 +242,109 @@ histograms. | |
|
||
## Single-Writer | ||
|
||
Pending | ||
All metric data streams within OTLP must have one logical writer. This means, | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
conceptually, that any Timeseries created from the Protocol must have one | ||
originating source of truth. In practical terms, this implies the following: | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- All metric data streams produce by OTel SDKs must by globally uniquely | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
produced and free from duplicates. All metric data streams can be uniquely | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What is considered a duplicate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was from the @jmacd document. I think this is synonymous with "overlap", defined below. I can update but want to check with Josh to make sure I haven't missed an important point here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am thinking of "duplicate" points as points in a stream with identical start and end timestamps, whereas overlapping points are defined based on the temporality. Cumulative points overlap, in one sense, but two points are compatible when they share a start time. Two cumulative streams would have an overlap condition when multiple start times are mixed with overlapping end times, however. Gauge (instantaneous temporality) and Delta temporality points have different overlap rules. Still, I think duplicate points are in a class of their own, particularly when we begin to talk about external labels as Prometheus defines them. Duplicate points may be expected, and when they are truly duplicates we typically discard all but one. |
||
identified in some way. | ||
- Aggregations of metric streams must only be written from a single logical | ||
source. | ||
__Note: This implies aggregated metric streams must reach one destination__. | ||
Comment on lines
+252
to
+254
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The part in bold kinda makes sense to me, but I don't see why this is a requirement of the protocol. The part about single source doesn't make sense to me. I think protocol needs to support data in such a format that it can be repeatedly re-aggregated, if necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how this requirement leads to data that can't be repeatedly re-aggregated. Can you provide a counter example where it's an issue? AFAIK you should be able to continually re-aggregate assuming:
In practice that's actually a pretty easy requirement. For backends where you drop resource, you're back to aggregated what looks like the same timeseries from multiple sources but within OTLP we can identify their separate sources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are trying to define a "single-writer" property so that we know when it is safe to re-aggregate data. There are ways to process metrics data that remove the single-writer property (e.g., horizontal scaling) and there are ways to restore that property (e.g., having a message-queue in place) to allow re-aggregation. We are trying to ensure that metrics originate from a single-writer so that we know these transformations will work. For these rules to work, we have to be sure that multiple processes cannot somehow write to the same stream, because that would mean they can overwrite each other. Thus, as @jsuereth implies, we will generate new metric streams following re-aggregation. |
||
|
||
In systems, there is the possibility of multiple writers sending data for the | ||
same metric stream (duplication). For example, if an SDK implementation fails | ||
to find uniquely identifying Resource attributes for a component, then all | ||
instances of that component could be reporting metrics as if they are from the | ||
same resource. In this case, metrics will be reported at inconsistent time | ||
intervals. For metrics like cumulative sums, this could cause issues where | ||
pairs of points appear to reset the cumulative sum leading to unusable metrics. | ||
|
||
Multiple writers for a metric stream is considered an error state, or | ||
misbehaving system. Receivers SHOULD presume a single writer was intended and | ||
eliminate overlap / deduplicate. | ||
|
||
Note: Identity is an important concept in most metrics systems. For example, | ||
[prometheus directlly calls out uniqueness](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configs): | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
> Care must be taken with `labeldrop` and `labelkeep` to ensure that metrics | ||
> are still uniquely labeled once the labels are removed. | ||
|
||
For OTLP, the Single-Writer principle grants a way to reason over error | ||
scenarios and take corrective actions. Additionally, it ensures that | ||
well-behaved systems can perform metric stream manipulation without undesired | ||
degradation or loss of visibility. | ||
|
||
## Temporarily | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Pending | ||
Every OTLP point has two associated timestamps. For OTLP Sum and Histogram | ||
points, the two timestamps indicate when the point was reset and when the sum | ||
was captured. For OTLP Gauge points, the two timestamps indicate when the | ||
measurement was taken and when it was reported as being still the last value. | ||
|
||
The notion of temporality refers to a configuration choice made in the system | ||
as a whole, indicating whether reported values incorporate previous | ||
measurements, or not. | ||
|
||
- *Cumulative temporality* means that successive data points repeat the starting | ||
timestamp. For example, from start time T0, cumulative data points cover time | ||
ranges (T<sub>0</sub>, T<sub>1</sub>), (T<sub>0</sub>, T<sub>2</sub>), | ||
(T<sub>0</sub>, T<sub>3</sub>), and so on. | ||
- *Delta temporality* means that successive data points advance the starting | ||
timestamp. For example, from start time T0, delta data points cover time | ||
ranges (T<sub>0</sub>, T<sub>1</sub>), (T<sub>1</sub>, T<sub>2</sub>), | ||
(T<sub>2</sub>, T<sub>3</sub>), and so on. | ||
|
||
The use of cumulative temporality for monotonic sums is common, exemplified by | ||
Prometheus. Systems based in cumulative monotonic sums are naturally simpler, in | ||
terms of the cost of adding reliability. When collection fails intermittently, | ||
gaps in the data are naturally averaged from cumulative measurements. | ||
Cumulative data requires the sender to remember all previous measurements, an | ||
“up-front” memory cost proportional to cardinality. | ||
|
||
The use of delta temporality for metric sums is also common, exemplified by | ||
Statsd. There is a connection between OpenTelemetry tracing, in which a Span | ||
event commonly is translated into two metric events (a 1-count and a timing | ||
measurement). Delta temporality enables sampling and supports shifting the cost | ||
of cardinality outside of the process. | ||
|
||
## Overlap | ||
|
||
Overlap occurs when more than one metric data point occurs for a data stream | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, overlap is the violation of single-writer requirement. Just to link this together with the single writer section, can we call that out? |
||
within a time window. This is particularly problematic for data points meant | ||
to represent an entire time window, e.g. a Histogram reporting population | ||
density of collected metric data points for a time window. If two of these show | ||
up with overlapping time windows, how do backends handle this situation? | ||
|
||
We define three principles for handling overlap: | ||
|
||
- Resolution (correction via dropping points) | ||
- Obersvability (allowing the data to flow to backends) | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Interpolation (correction via data manipulation) | ||
|
||
### Overlap resolution | ||
|
||
When more than one process writes the same metric data stream, OTLP data points | ||
may appear to overlap. This condition typically results from misconfiguration, but | ||
can also result from running identical processes (indicative of operating system | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
or SDK bugs, like missing | ||
[process attributes](../resource/semantic_conventions/process.md)). When there | ||
are overlapping points, receivers SHOULD eliminate points so that there are no | ||
jsuereth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
overlaps. Which data to select in overlapping cases is not specified. | ||
Comment on lines
+333
to
+334
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. L333
and L338
Don't these conflict each other, or is "receivers" meant to mean backends specifically? |
||
|
||
### Overlap observability | ||
|
||
OpenTelemetry collectors SHOULD export telemetry when they observe overlapping | ||
points in data streams, so that the user can monitor for erroneous | ||
configurations. | ||
|
||
### Overlap interpolation | ||
|
||
When one process starts just as another exits, the appearance of overlapping | ||
points may be expected. In this case, OpenTelemetry collectors SHOULD modify | ||
points at the change-over using interpolation for Sum data points, to reduce | ||
gaps to zero width in these cases, without any overlap. | ||
Comment on lines
+344
to
+347
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As resource semantic conventions are specified with
So maybe this falls under the advice above – "collectors SHOULD export telemetry when they observe overlapping |
||
|
||
## Resources | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never defined, yet imbued with special meaning in this document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined on line 202, and yes this document imbues it with special meaning. We decided in the metric data model SiG that we needed a term to identify an "in-motion" stream of metric data that OpenTelemetry operates on to differentiate it from a Time Series generated from that stream.
Should I move the definition/description of metric data stream further above? Do you need more motivation for why we want a different term here?
Put simply, a lot of your concerns/comments are around generated TimeSeries from open-telemetry, whereas this document outlines what's required for a Metric Data Stream. See my follow on comments for examples where I think having this distinction helps opentelemetry AND backends.