feat!: OpenTelemetry metrics #13232

Joibel · 2024-06-21T11:17:43Z

Note to reviewers, this is a draft to check CI at the moment.

Motivation

As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense.

It helps lay the groundwork for adding workflow tracing using otel.

This PR amends and extends the built in metrics to be more useful and correctly named.

This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics.

Modifications

Mostly removed prometheus libraries, and replaced with otel libraries.

Allows use of both prometheus /metrics scraping and opentelemetry protocol transmission.

See the migration guide for a quick summary of the changes.

See the metrics documentation for more details on new metrics.

Extends the workqueue metrics to include all of them exposed by client-go/workqueue

Allows removal of metrics, removal of attributes (labels) and reconfiguration of histogram buckets for all metrics using options in the workflow-controller-configmap

Allows temporality configuration for opentelemetry counters.

Adds a number of other new metrics:

Removed argo_workflows_workflows_processed_count as a non-functioning metric.

originName enhancements proposed in #12589 were not added.

Verification

Signed-off-by: Alan Clucas <[email protected]>

agilgur5

I only briefly went through this, with particular focus on the docs and a bit on the config -- just an initial quick review. There's a good amount of docs changes needed.

As I've mentioned in other docs PRs you've had, please follow the docs style guides. In this quick look through, I particularly noticed many misses on the k8s style guide regarding "Simple and direct language" and our style guide regarding "Prefer 1 sentence per line of markdown".
The latter is explicit, while the former contains examples -- if you're struggling with that, a rough mindset you can use while reading/writing/editing is "can I say this with less words?" (while ofc still following the rest of the style guide, existing conventions, and without making things very confusing). The answer is very often "yes" as this style of writing can be quite different from the way we speak or respond to users etc (you and I in particular use more "level of confidence" or "hedging" words as well as other descriptors). For instance, in the very preceding sentence, can remove "very", "quite", etc to simplify as: "Often this writing style is different from others".

docs/metrics-3.6.md

docs/metrics.md

.spelling

docs/metrics-3.6.md

agilgur5 · 2024-06-22T02:52:27Z

docs/metrics-3.6.md

+
+To disable this set `.metricsConfig.secure` to `false`.
+
+## New metrics


hmm, this isn't a breaking change... we typically don't describe additions in the upgrading notes

It is a breaking change for anyone who doesn't explicitly set secure in their configmap.

I was commenting on the "New Metrics" section (that's an addition), not the TLS bit. It's also the longest section, so removing it would make the upgrade notes a lot shorter.

I think conventionally we just add new features to the existing docs (no specific upgrade note for them in the docs, though maybe in the release blog post) and add the "v3.6 and after" modifier where needed.

New metrics will get consumed into your data collection tool by default, which may be costly. It feels important to warn people that they could be in for an increased bill by upgrading if they don't do something about it. I'd only expect people to read the upgrading guide.

docs/metrics.md

agilgur5

Given how large this is, I wonder if we should split this into multiple PRs for readability & reviewability.

For instance:

Prometheus -> OTel
New metrics (could potentially be further split if necessary, e.g. workqueue vs own metrics)
New configurations (could potentially be further split if necessary, e.g. individual metric config vs the rest)

agilgur5 · 2024-06-22T03:19:06Z

docs/workflow-controller-configmap.yaml

+    # Which temporality to use for opentelemetry, defaults to Cumulative
+    temporality: Delta
+    # Options for configuring individual metrics
+    options:


I feel like there's a better name for this, but it is evading me right now

I've changed it to "modifiers" locally, which is the best I have for now.

That does sound better, although also feels like it could be better 😅

Suggestions welcome of course

Signed-off-by: Alan Clucas <[email protected]>

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Joibel · 2024-06-24T11:35:05Z

Thanks for all of the review comments.

I wasn't really ready for a review yet (as noted by this being in draft and the big comment at the top of the description), but I've tried to accommodate your comments.

Splitting this would be pretty non-fun given everything except for the base commit would want rebasing every time anything in the base commit changes. I'd tried to give fair warning in #12589 that I'd be doing what I've done.

The config changes are pretty small.
The new metrics are nearly all self contained.

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Signed-off-by: Alan Clucas <[email protected]>

agilgur5 · 2024-06-24T18:21:34Z

I wasn't really ready for a review yet (as noted by this being in draft and the big comment at the top of the description), but I've tried to accommodate your comments.

Yes I noticed that of course. I didn't go through the code at all and not even the entirety of docs or config (nor as thorough as I can be), so kept it quick, high-level, and not CI impacting. I figure you might want early feedback if you put up a draft and that early feedback is always better than later feedback. You don't have to respond to the comments immediately, we can just keep a running list of feedback until it's more ready and then can address them -- usually, in my experience, drafts have an expectedly slower review cycle on both ends.

agilgur5 · 2024-06-24T18:49:05Z

Splitting this would be pretty non-fun given everything except for the base commit would want rebasing every time anything in the base commit changes.

PR stacks a la Graphite or Sapling (etc) can help automate that (both are overlays on top of Git; they're not as useful without their respective review tools though, and dedicated review tools are better than GitHub code review 9 times out of 10). I still do it all manually myself though, but it's exactly what PR stacks are designed for.
But the PRs sequential and largely contained, as you wrote, so there wouldn't be too many conflicts and they'd still be reviewed sequentially given that GH does not support PR stacks (so a subsequent diff will contain everything until the prior one is merged).

I'd tried to give fair warning in #12589 that I'd be doing what I've done.

As the main reviewer there, I'm well aware of that. That doesn't change any readability or reviewability concerns.
The config changes in particular are new and expose a new API surface that we'd want to try to keep more stable.
There are also more new metrics than you originally wrote about.

The config changes are pretty small.
The new metrics are nearly all self contained.

At ~5k LoC changed across 64 files -- with almost none of it being generated code either -- anything can help. The new metrics in particular I would think might not need much iteration, but the OTel, config, and docs changes definitely need a careful eye given how fundamental they are as well as to reduce chance of future breakage and make as understandable to the user as possible.

Joibel · 2024-06-28T13:33:12Z

Replaced by #13265

Joibel changed the title ~~feat: opentelemetry metrics~~ feat: OpenTelemetry metrics Jun 21, 2024

Joibel force-pushed the otel branch 8 times, most recently from bdbb830 to f5904b7 Compare June 21, 2024 14:15

feat: opentelemetry metrics

51843cc

Signed-off-by: Alan Clucas <[email protected]>

Joibel force-pushed the otel branch from f5904b7 to 51843cc Compare June 21, 2024 14:20

agilgur5 added area/controller Controller issues, panics area/metrics labels Jun 22, 2024

agilgur5 suggested changes Jun 22, 2024

View reviewed changes

agilgur5 reviewed Jun 22, 2024

View reviewed changes

docs/metrics.md Outdated Show resolved Hide resolved

agilgur5 reviewed Jun 22, 2024

View reviewed changes

Joibel and others added 10 commits June 24, 2024 09:40

fix: e2e metrics test

70eabf7

Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics.md

87f3dd4

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics-3.6.md

148a334

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics-3.6.md

d342f20

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics-3.6.md

7aa8eb4

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics-3.6.md

80c23fd

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics.md

56cac32

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update .spelling

4e5b4f3

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics-3.6.md

088f5f4

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Update docs/metrics.md

67755a4

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

Joibel and others added 2 commits June 24, 2024 12:43

Update docs/metrics.md

c9e6d25

Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>

some renames

5031110

Signed-off-by: Alan Clucas <[email protected]>

Joibel force-pushed the otel branch from ae9c924 to 5031110 Compare June 24, 2024 12:09

docs: notes on prometheus

41efbd7

Signed-off-by: Alan Clucas <[email protected]>

Joibel force-pushed the otel branch from a09fd17 to 41efbd7 Compare June 24, 2024 14:28

Upgrading docs

7107b2c

Signed-off-by: Alan Clucas <[email protected]>

Joibel force-pushed the otel branch from 3e936ff to 7107b2c Compare June 24, 2024 15:52

agilgur5 changed the title ~~feat: OpenTelemetry metrics~~ feat!: OpenTelemetry metrics Jun 24, 2024

Joibel mentioned this pull request Jun 28, 2024

feat!: OpenTelemetry metrics #13265

Merged

Joibel closed this Jun 28, 2024

agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Jun 30, 2024

Joibel deleted the otel branch July 9, 2024 13:53

agilgur5 self-assigned this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: OpenTelemetry metrics #13232

feat!: OpenTelemetry metrics #13232

Joibel commented Jun 21, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

agilgur5 Jun 22, 2024

Joibel Jun 24, 2024

agilgur5 Jun 24, 2024 •

edited

Loading

Joibel Jun 25, 2024

agilgur5 left a comment •

edited

Loading

agilgur5 Jun 22, 2024

Joibel Jun 24, 2024

agilgur5 Jun 24, 2024

Joibel Jun 25, 2024

Joibel commented Jun 24, 2024

agilgur5 commented Jun 24, 2024 •

edited

Loading

agilgur5 commented Jun 24, 2024

Joibel commented Jun 28, 2024


		To disable this set `.metricsConfig.secure` to `false`.

		## New metrics

feat!: OpenTelemetry metrics #13232

feat!: OpenTelemetry metrics #13232

Conversation

Joibel commented Jun 21, 2024 • edited Loading

Note to reviewers, this is a draft to check CI at the moment.

Motivation

Modifications

Verification

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

agilgur5 Jun 22, 2024

Choose a reason for hiding this comment

Joibel Jun 24, 2024

Choose a reason for hiding this comment

agilgur5 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Joibel Jun 25, 2024

Choose a reason for hiding this comment

agilgur5 left a comment • edited Loading

Choose a reason for hiding this comment

agilgur5 Jun 22, 2024

Choose a reason for hiding this comment

Joibel Jun 24, 2024

Choose a reason for hiding this comment

agilgur5 Jun 24, 2024

Choose a reason for hiding this comment

Joibel Jun 25, 2024

Choose a reason for hiding this comment

Joibel commented Jun 24, 2024

agilgur5 commented Jun 24, 2024 • edited Loading

agilgur5 commented Jun 24, 2024

Joibel commented Jun 28, 2024

Joibel commented Jun 21, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

agilgur5 Jun 24, 2024 •

edited

Loading

agilgur5 left a comment •

edited

Loading

agilgur5 commented Jun 24, 2024 •

edited

Loading