-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: OpenTelemetry metrics #13232
feat!: OpenTelemetry metrics #13232
Conversation
bdbb830
to
f5904b7
Compare
Signed-off-by: Alan Clucas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only briefly went through this, with particular focus on the docs and a bit on the config -- just an initial quick review. There's a good amount of docs changes needed.
As I've mentioned in other docs PRs you've had, please follow the docs style guides. In this quick look through, I particularly noticed many misses on the k8s style guide regarding "Simple and direct language" and our style guide regarding "Prefer 1 sentence per line of markdown".
The latter is explicit, while the former contains examples -- if you're struggling with that, a rough mindset you can use while reading/writing/editing is "can I say this with less words?" (while ofc still following the rest of the style guide, existing conventions, and without making things very confusing). The answer is very often "yes" as this style of writing can be quite different from the way we speak or respond to users etc (you and I in particular use more "level of confidence" or "hedging" words as well as other descriptors). For instance, in the very preceding sentence, can remove "very", "quite", etc to simplify as: "Often this writing style is different from others".
docs/metrics-3.6.md
Outdated
|
||
To disable this set `.metricsConfig.secure` to `false`. | ||
|
||
## New metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this isn't a breaking change... we typically don't describe additions in the upgrading notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a breaking change for anyone who doesn't explicitly set secure
in their configmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was commenting on the "New Metrics" section (that's an addition), not the TLS bit. It's also the longest section, so removing it would make the upgrade notes a lot shorter.
I think conventionally we just add new features to the existing docs (no specific upgrade note for them in the docs, though maybe in the release blog post) and add the "v3.6 and after" modifier where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New metrics will get consumed into your data collection tool by default, which may be costly. It feels important to warn people that they could be in for an increased bill by upgrading if they don't do something about it. I'd only expect people to read the upgrading guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how large this is, I wonder if we should split this into multiple PRs for readability & reviewability.
For instance:
- Prometheus -> OTel
- New metrics (could potentially be further split if necessary, e.g. workqueue vs own metrics)
- New configurations (could potentially be further split if necessary, e.g. individual metric config vs the rest)
# Which temporality to use for opentelemetry, defaults to Cumulative | ||
temporality: Delta | ||
# Options for configuring individual metrics | ||
options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there's a better name for this, but it is evading me right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to "modifiers" locally, which is the best I have for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does sound better, although also feels like it could be better 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions welcome of course
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Thanks for all of the review comments. I wasn't really ready for a review yet (as noted by this being in draft and the big comment at the top of the description), but I've tried to accommodate your comments. Splitting this would be pretty non-fun given everything except for the base commit would want rebasing every time anything in the base commit changes. I'd tried to give fair warning in #12589 that I'd be doing what I've done. The config changes are pretty small. |
Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Yes I noticed that of course. I didn't go through the code at all and not even the entirety of docs or config (nor as thorough as I can be), so kept it quick, high-level, and not CI impacting. I figure you might want early feedback if you put up a draft and that early feedback is always better than later feedback. You don't have to respond to the comments immediately, we can just keep a running list of feedback until it's more ready and then can address them -- usually, in my experience, drafts have an expectedly slower review cycle on both ends. |
PR stacks a la Graphite or Sapling (etc) can help automate that (both are overlays on top of Git; they're not as useful without their respective review tools though, and dedicated review tools are better than GitHub code review 9 times out of 10). I still do it all manually myself though, but it's exactly what PR stacks are designed for.
As the main reviewer there, I'm well aware of that. That doesn't change any readability or reviewability concerns.
At ~5k LoC changed across 64 files -- with almost none of it being generated code either -- anything can help. The new metrics in particular I would think might not need much iteration, but the OTel, config, and docs changes definitely need a careful eye given how fundamental they are as well as to reduce chance of future breakage and make as understandable to the user as possible. |
Replaced by #13265 |
Note to reviewers, this is a draft to check CI at the moment.
Led by #12589.
Motivation
As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense.
It helps lay the groundwork for adding workflow tracing using otel.
This PR amends and extends the built in metrics to be more useful and correctly named.
This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics.
Modifications
Mostly removed prometheus libraries, and replaced with otel libraries.
Allows use of both prometheus
/metrics
scraping and opentelemetry protocol transmission.See the migration guide for a quick summary of the changes.
See the metrics documentation for more details on new metrics.
Extends the workqueue metrics to include all of them exposed by client-go/workqueue
Allows removal of metrics, removal of attributes (labels) and reconfiguration of histogram buckets for all metrics using
options
in the workflow-controller-configmapAllows temporality configuration for opentelemetry counters.
Adds a number of other new metrics:
Removed
argo_workflows_workflows_processed_count
as a non-functioning metric.originName
enhancements proposed in #12589 were not added.Verification