Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collectd 6] New plugin (sort of): OpenTelemetry receiver #4271

Merged
merged 22 commits into from
Feb 20, 2024

Conversation

octo
Copy link
Member

@octo octo commented Feb 2, 2024

This PR adds an gRPC server to collectd, implementing the OpenTelemetry MetricsService. This allows collectd to receive metrics via OTLP, for example from the OpenTelemetry Collector.

This new code and the write_open_telemetry plugin have been merged into one plugin, called "open_telemetry". Users can enable the exporter, the receiver, or both using the config file, just like they could in the network plugin in collectd 5.

Known limitations:

  • Non-monotonic sums are rejected. We could either convert them to "gauge" or re-introduce "derive" metrics to get them to work.
  • Counter resets are not yet detected. Mostly because this needs additional code in the "cache" and this PR is already quite large.
  • Histograms, exponential histograms, and summary metrics are rejected.

ChangeLog: OpenTelemetry plugin: This new plugin provides the ability to export as well as receive metrics via OTLP. It supersedes the write_open_telemetry plugin.

configure.ac Show resolved Hide resolved
src/collectd.conf.pod Outdated Show resolved Hide resolved
src/collectd.conf.pod Show resolved Hide resolved
src/daemon/utils_cache.c Show resolved Hide resolved
src/open_telemetry_exporter.cc Show resolved Hide resolved
src/open_telemetry_receiver.cc Show resolved Hide resolved
Comment on lines +228 to +210
// TODO(octo): convert to gauge instead?
DEBUG("open_telemetry plugin: non-monotonic sums (aka. UpDownCounters) "
"are unsupported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of OTEL metrics are UpDownCounters:

And both that and OTEL Counter type are (listed as) signed, whereas collectd counters are (correctly) unsigned.

PS. apparently e.g. Gauge can be either Int or Double: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwbattery---battery-metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that OpenTelemetry itself is a bit confused about the role of UpDownCounter:

From the description ofUpDownCounters:

An UpDownCounter is intended for scenarios where the absolute values are not pre-calculated, or fetching the “current value” requires extra effort.

This seems to put emphasis on the cumulative/temporal aspect: I can report that I restocked or sold n items without knowing exactly how many of those items I have in stock. I.e. we compare the value at time t with the value at t-1 and report the difference. For me that implies that only the difference to the previous point (or the rate) is relevant and the absolute value of the cumulative metric is meaningless.

However, from the description of Gauge:

Note: If the values are additive (e.g. the process heap size - it makes sense to report the heap size from multiple processes and sum them up, so we get the total heap usage), use UpDownCounter.

This puts the emphasis on the aggregation aspect: does it make sense to sum up different instances of this metric? Yes for memory usage, no for temperature => memory usage is UpDownCounter, temperature is Gauge. That would imply that the absolute value of an UpDownCounter is meaningful and calculating the rate, e.g. of "used memory" is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what should be done here, but TODO is fine for this PR. Let's leave this open as a reminder....

Copy link
Member Author

@octo octo Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the chat history of the OpenTelemetry slack, it's clear that UpDownCounters have "gauge like semantics". While I vehemently disagree with the naming of the instrument and the mapping in the wire protocol, I think this ship has sailed and we should just accept that there are two "gauge like" metric types.

Implementation in #4287.

src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
Copy link
Member Author

@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eero-t!

configure.ac Show resolved Hide resolved
src/collectd.conf.pod Show resolved Hide resolved
src/collectd.conf.pod Outdated Show resolved Hide resolved
src/daemon/utils_cache.c Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
src/open_telemetry_receiver.cc Show resolved Hide resolved
Comment on lines +228 to +210
// TODO(octo): convert to gauge instead?
DEBUG("open_telemetry plugin: non-monotonic sums (aka. UpDownCounters) "
"are unsupported");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that OpenTelemetry itself is a bit confused about the role of UpDownCounter:

From the description ofUpDownCounters:

An UpDownCounter is intended for scenarios where the absolute values are not pre-calculated, or fetching the “current value” requires extra effort.

This seems to put emphasis on the cumulative/temporal aspect: I can report that I restocked or sold n items without knowing exactly how many of those items I have in stock. I.e. we compare the value at time t with the value at t-1 and report the difference. For me that implies that only the difference to the previous point (or the rate) is relevant and the absolute value of the cumulative metric is meaningless.

However, from the description of Gauge:

Note: If the values are additive (e.g. the process heap size - it makes sense to report the heap size from multiple processes and sum them up, so we get the total heap usage), use UpDownCounter.

This puts the emphasis on the aggregation aspect: does it make sense to sum up different instances of this metric? Yes for memory usage, no for temperature => memory usage is UpDownCounter, temperature is Gauge. That would imply that the absolute value of an UpDownCounter is meaningful and calculating the rate, e.g. of "used memory" is not.

src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, just couple of minor notes which could be handled before merging.

src/open_telemetry_exporter.cc Show resolved Hide resolved
src/open_telemetry_receiver.cc Outdated Show resolved Hide resolved
octo added 19 commits February 20, 2024 15:28
octo added 2 commits February 20, 2024 15:28
* Set field to `NULL` after freeing.
* Remove unused global variable.
@octo octo merged commit cd231ae into collectd:collectd-6.0 Feb 20, 2024
23 checks passed
@octo octo deleted the 6/otelcol branch February 20, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants