Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Rationalize naming of metric instruments and their default aggregations #96

Closed
wants to merge 10 commits into from
Closed
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions text/0096-metric-instrument-terminology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Rationalize naming of metric instruments and their default aggregations

Propose final names for the seven metric instruments introduced in [OTEP 93](https://github.com/open-telemetry/oteps/pull/93) and address related confusion.

## Motivation

[OTEP 88](https://github.com/open-telemetry/oteps/pull/88) introduced
a logical structure for metric instruments with two foundational
categories of instrument, called "synchronous" vs. "asynchronous",
named "Measure" and "Observer" in the abstract. This proposal
identified four kinds of "refinement" and mapped out the space of
_possible_ instruments, while not proposing which would actually be
included in the standard.

[OTEP 93](https://github.com/open-telemetry/oteps/pull/93) followed
with a list of six standard instruments, the most necessary and useful
combination of instrument refinements, plus one special case used to
record timing measurements.

This proposal finalizes the names used to describe the seven
instruments above, seeking to address core confusion related to
"Measure":

1. OTEP 88 stipulates that the terms currently in use to name
synchronous and asynchronous instruments become abstract, but also
using "Measure-like" and "Observer-like" to discuss instruments with
refinements. This proposal states that we shall prefer the
adjectives, commonly abbreviated "Sync" and "Async", when describing
instruments.
2. Prior to OTEP 88, but even with OTEPs 88 and 93 included, there is
inconsistency in the naming of instruments. Note that "Counter" and
"Observer" end in "-er", a noun suffix used in the sense of "[person
occupationally connected
with](https://www.merriam-webster.com/dictionary/-er)", while the term
"Measure" does not fit this pattern. This proposal proposes to
replace the abstract term "Measure" by "Recorder", since the
associated method name (verb) is specified as `Record()`.
3. The OTEP 93 asynchronous instruments ("LastValueObserver",
"DeltaObserver", and "CumulativeObserver") have the pattern
"-Observer", while the OTEP 93 synchronous instruments
("Counter", "UpDownCounter", "Distribution", "Timing") do not. This
proposal keeps "Counter" and "UpDownCounter" for Sum-only synchronous instruments, and does the same
with "Recorder", yielding "Recorder" and "TimingRecorder".
4. Confusion over the loss of "Gauge" is addressed by replacing
"LastValueObserver" with "GaugeObserver".

This proposal also repeats the current specification of the default
Aggregator for each kind of instrument.

## Explanation

The following table summarizes the four synchronous instruments and
three asynchronous instruments that will be standardized as a result
of this set of proposals.

| Existing name | OTEP 93 name | **Final name** | Sync or Async | Function | Default aggregation | Rate support |
| ------------- | ------------------ | ---------------------- | ----------- | ------------- | ---------- | ---- |
| Counter | Counter | **Counter** | Sync | Add() | Sum | Yes |
| | UpDownCounter | **UpDownCounter** | Sync | Add() | Sum | Yes |
jmacd marked this conversation as resolved.
Show resolved Hide resolved
| Measure | Distribution | **Recorder** | Sync | Record() | MinMaxSumCount | No |
| | Timing | **TimingRecorder** | Sync | Record() | MinMaxSumCount | No |
jmacd marked this conversation as resolved.
Show resolved Hide resolved
| Observer | LastValueObserver | **GaugeObserver** | Async | Observe() | MinMaxSumCount | No |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want this to be Sum because of memory-usage metric. Look at the Go example for process metrics open-telemetry/opentelemetry-specification#549 (comment) I don't want to see "process/sys_heap" as a minmaxsumcount by default.

Another argument is default cost (money), if I am using a system like Stackdriver exporting "minmaxsumcount" will cost 4 times more than just sum. So I think it is important that default aggregations are less "expensive".

PS: I know I work for a company that makes money if the user sends more data :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I think minmaxsumcount is perfectly sensible for a heap. If you aggregate this across a cluster, you'd like to know what the min/max values are. If you don't aggregate the count as well, the metric system will requires domain-level knowledge to construct an average using some other knowledge (i.e.,, that there's a corresponding count available, implicitly).

Moreover, I've documented (here) how there's an obvious optimization that results in sending just one value for the min/max/sum/count when there is just one value, which reduces the data-size problem for the case you care about. Each machine will report one min/max/sum/count as a single value in the ordinary case, but when we aggregate those values, we'll get the proper min/max/sum and count out--no domain-knowledge required.

Lastly, the metric instrument you're describing is exactly the hypothetical UpDownCumulativeObserver detailed in OTEP 88. If you want a Sum-only aggregation, then you should use a Sum-only instrument. That instrument does exactly what you want, should we standardize it?

| | DeltaObserver | **DeltaObserver** | Async | Observe() | Sum | Yes |
| | CumulativeObserver | **CumulativeObserver** | Async | Observe() | Sum | Yes |

The argument for "Recorder" instead of "Distribution" is that we
should prefer instrument descriptives associated with the action being
performed ("occupationally connected"), not the value being computed,
as the latter is dependent on SDK configuration. A "Recorder" records
a value that is part of a distribution. A "Counter" counts a value
jmacd marked this conversation as resolved.
Show resolved Hide resolved
that is part of a sum. An "GaugeObserver" observes an instantaneous
value ("reads a gauge"). A "Recorder" records an arbitrary value. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing that "Recorder" is included twice in this list. Once saying it records a distribution and the other saying it records an arbitrary value. I think the latter is more accurate. If a view of a Recorder were to 'aggregate' with an array (i.e. no aggregation) saying a "Recorder" records a value that is part of a distribution, while not incorrect, isn't precise.

"TimingRecorder" records a timing value, and so on.

## Details

This proposal consolidates OTEP 88 and OTEP 93 and proposes a consistent
pattern for naming instruments. It will be the source of truth when
applying OTEP 88 and OTEP 93 to the OpenTelemetry metrics specification.

### Function names

The function names of the standard instruments are determined as
follows.

#### Counter and UpDownCounter

Counter and UpDownCounter instruments use `Add()` as the function
name, since they capture deltas to a Sum-only instrument. We prefer
`Add()` as opposed to `Count()`, since floating point numbers are
supported, to avoid some an association with "Countable" numbers, a
mathemtical term associated with natural numbers.

#### Recorder and TimingRecorder

Recorder and TimingRecorder use `Record()` as the function name, as in
the existing specification for Measure instruments. This conveys the
fact that these are not a sum, and that individual events are of
importance.

#### Asynchronous instruments

Asynchronous instruments use `Observe()` as the function name. This
signifies that the instrument passively captures a measurement, is not
an active participant, as implied by `Record()`. _Observation_ also
conveys the last-value relationship specified for asynchronous
instruments. The observer can only observe one value at a time, where
the last-observed value wins.

### Default aggregations

This [OTEP 93
conversation](https://github.com/open-telemetry/oteps/pull/93#discussion_r405852507)
raised a question about the default aggregation for GaugeObserver,
given as MinMaxSumCount. Would "Sum" be a more appropriate default?

Note that the distinction between whether the default aggregation is
"Sum" or "MinMaxSumCount" corresponds exactly to whether the
instrument has the Sum-only refinement. "Sum" is the default
aggregation for any Sum-only instrument since, by definition, the
Sum aggregation provides complete information.

The three instruments with a default "MinMaxSumCount" are all used to
record a value that is, by definition, more than only a sum. In this
case, "complete information" requires recording every value, i.e., no
aggregation. MinMaxSumCount is applied in these cases because it
provides the maximum amount of information that can be recorded using
a fixed number of values, per time series, per collection interval.

### GaugeObserver aggregation

Why should GaugeObserver aggregate the Min, Max, Sum, and Count when
it is permitted to observe just one measurement per interval? This
says that when observed values are aggregated they should be treated
like a distribution--we are intersted in more than a sum, by
definition. If observing only a sum, the DeltaObserver or
CumulativeObserver should be used instead.

Clearly, when Count equals 1, the Min, Max, and Sum are equal to the
value. Exporters may be able take advantage of this fact when
exporting data from these instruments. In particular, since it is
known that asynchronous instruments produce only one valiue per
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
known that asynchronous instruments produce only one valiue per
known that asynchronous instruments produce only one value per

interval (with last-value-wins semantics), when we know in the SDK
that no spatial aggregation is configured, we can be sure that Count
equals one, and we can use the most appropriate exposition format for
the target system.

This means Prometheus and Statsd exporters SHOULD export Gauge values
for the GaugeObserver when there is no spatial aggregation being
applied, because that is the natural exposition format for
MinMaxSumCount aggregations when Count equals 1. If there is spatial
aggregation being applied, the default MinMaxSumCount aggregation
still applies.