Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: OpenTelemetry metrics #13232

Closed
wants to merge 15 commits into from
Closed

feat!: OpenTelemetry metrics #13232

wants to merge 15 commits into from

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Jun 21, 2024

Note to reviewers, this is a draft to check CI at the moment.

Led by #12589.

Motivation

As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense.

It helps lay the groundwork for adding workflow tracing using otel.

This PR amends and extends the built in metrics to be more useful and correctly named.

This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics.

Modifications

Mostly removed prometheus libraries, and replaced with otel libraries.

Allows use of both prometheus /metrics scraping and opentelemetry protocol transmission.

See the migration guide for a quick summary of the changes.

See the metrics documentation for more details on new metrics.

Extends the workqueue metrics to include all of them exposed by client-go/workqueue

Allows removal of metrics, removal of attributes (labels) and reconfiguration of histogram buckets for all metrics using options in the workflow-controller-configmap

Allows temporality configuration for opentelemetry counters.

Adds a number of other new metrics:

Removed argo_workflows_workflows_processed_count as a non-functioning metric.

originName enhancements proposed in #12589 were not added.

Verification

@Joibel Joibel changed the title feat: opentelemetry metrics feat: OpenTelemetry metrics Jun 21, 2024
@Joibel Joibel force-pushed the otel branch 8 times, most recently from bdbb830 to f5904b7 Compare June 21, 2024 14:15
Signed-off-by: Alan Clucas <[email protected]>
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only briefly went through this, with particular focus on the docs and a bit on the config -- just an initial quick review. There's a good amount of docs changes needed.

As I've mentioned in other docs PRs you've had, please follow the docs style guides. In this quick look through, I particularly noticed many misses on the k8s style guide regarding "Simple and direct language" and our style guide regarding "Prefer 1 sentence per line of markdown".
The latter is explicit, while the former contains examples -- if you're struggling with that, a rough mindset you can use while reading/writing/editing is "can I say this with less words?" (while ofc still following the rest of the style guide, existing conventions, and without making things very confusing). The answer is very often "yes" as this style of writing can be quite different from the way we speak or respond to users etc (you and I in particular use more "level of confidence" or "hedging" words as well as other descriptors). For instance, in the very preceding sentence, can remove "very", "quite", etc to simplify as: "Often this writing style is different from others".

docs/metrics-3.6.md Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
.spelling Outdated Show resolved Hide resolved
.spelling Outdated Show resolved Hide resolved
docs/metrics-3.6.md Outdated Show resolved Hide resolved
docs/metrics-3.6.md Outdated Show resolved Hide resolved
docs/metrics-3.6.md Outdated Show resolved Hide resolved
docs/metrics-3.6.md Outdated Show resolved Hide resolved

To disable this set `.metricsConfig.secure` to `false`.

## New metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this isn't a breaking change... we typically don't describe additions in the upgrading notes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a breaking change for anyone who doesn't explicitly set secure in their configmap.

Copy link

@agilgur5 agilgur5 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was commenting on the "New Metrics" section (that's an addition), not the TLS bit. It's also the longest section, so removing it would make the upgrade notes a lot shorter.

I think conventionally we just add new features to the existing docs (no specific upgrade note for them in the docs, though maybe in the release blog post) and add the "v3.6 and after" modifier where needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New metrics will get consumed into your data collection tool by default, which may be costly. It feels important to warn people that they could be in for an increased bill by upgrading if they don't do something about it. I'd only expect people to read the upgrading guide.

docs/metrics.md Outdated Show resolved Hide resolved
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how large this is, I wonder if we should split this into multiple PRs for readability & reviewability.

For instance:

  • Prometheus -> OTel
  • New metrics (could potentially be further split if necessary, e.g. workqueue vs own metrics)
  • New configurations (could potentially be further split if necessary, e.g. individual metric config vs the rest)

# Which temporality to use for opentelemetry, defaults to Cumulative
temporality: Delta
# Options for configuring individual metrics
options:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's a better name for this, but it is evading me right now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to "modifiers" locally, which is the best I have for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does sound better, although also feels like it could be better 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions welcome of course

Joibel and others added 10 commits June 24, 2024 09:40
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
@Joibel
Copy link
Member Author

Joibel commented Jun 24, 2024

Thanks for all of the review comments.

I wasn't really ready for a review yet (as noted by this being in draft and the big comment at the top of the description), but I've tried to accommodate your comments.

Splitting this would be pretty non-fun given everything except for the base commit would want rebasing every time anything in the base commit changes. I'd tried to give fair warning in #12589 that I'd be doing what I've done.

The config changes are pretty small.
The new metrics are nearly all self contained.

Joibel and others added 2 commits June 24, 2024 12:43
Co-authored-by: Anton Gilgur <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
@agilgur5 agilgur5 changed the title feat: OpenTelemetry metrics feat!: OpenTelemetry metrics Jun 24, 2024
@agilgur5
Copy link

agilgur5 commented Jun 24, 2024

I wasn't really ready for a review yet (as noted by this being in draft and the big comment at the top of the description), but I've tried to accommodate your comments.

Yes I noticed that of course. I didn't go through the code at all and not even the entirety of docs or config (nor as thorough as I can be), so kept it quick, high-level, and not CI impacting. I figure you might want early feedback if you put up a draft and that early feedback is always better than later feedback. You don't have to respond to the comments immediately, we can just keep a running list of feedback until it's more ready and then can address them -- usually, in my experience, drafts have an expectedly slower review cycle on both ends.

@agilgur5
Copy link

Splitting this would be pretty non-fun given everything except for the base commit would want rebasing every time anything in the base commit changes.

PR stacks a la Graphite or Sapling (etc) can help automate that (both are overlays on top of Git; they're not as useful without their respective review tools though, and dedicated review tools are better than GitHub code review 9 times out of 10). I still do it all manually myself though, but it's exactly what PR stacks are designed for.
But the PRs sequential and largely contained, as you wrote, so there wouldn't be too many conflicts and they'd still be reviewed sequentially given that GH does not support PR stacks (so a subsequent diff will contain everything until the prior one is merged).

I'd tried to give fair warning in #12589 that I'd be doing what I've done.

As the main reviewer there, I'm well aware of that. That doesn't change any readability or reviewability concerns.
The config changes in particular are new and expose a new API surface that we'd want to try to keep more stable.
There are also more new metrics than you originally wrote about.

The config changes are pretty small.
The new metrics are nearly all self contained.

At ~5k LoC changed across 64 files -- with almost none of it being generated code either -- anything can help. The new metrics in particular I would think might not need much iteration, but the OTel, config, and docs changes definitely need a careful eye given how fundamental they are as well as to reduce chance of future breakage and make as understandable to the user as possible.

@Joibel
Copy link
Member Author

Joibel commented Jun 28, 2024

Replaced by #13265

@Joibel Joibel closed this Jun 28, 2024
@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Jun 30, 2024
@Joibel Joibel deleted the otel branch July 9, 2024 13:53
@agilgur5 agilgur5 self-assigned this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants