Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: metrics overhaul including opentelemetry #12589

Closed
Joibel opened this issue Jan 30, 2024 · 10 comments
Closed

proposal: metrics overhaul including opentelemetry #12589

Joibel opened this issue Jan 30, 2024 · 10 comments
Assignees
Labels
area/controller Controller issues, panics area/metrics type/feature Feature request

Comments

@Joibel
Copy link
Member

Joibel commented Jan 30, 2024

Proposal for improving workflows metrics

This proposal is an overhaul of workflow metrics to bring them up to date, correct them, and make sure they are useful. It is intended to be a breaking change requiring users to update their use of metrics.

Opentelemetery

I'd like to move to using opentelemetry (otel) as the libraries for collecting metrics. They are mature and the design of the interfaces encourages good practices.

We'd still be able to expose the resulting metrics as a prometheus compatible /metrics scrapable endpoint using any prometheus compatible scraper.

It would allow us to also emit the same metrics using the opentelemetry protocol with the benefits this provides. This would also allow us to emit tracing using the same protocol and libraries. I do not intend to implement tracing as part of this proposal.

Each transmission protocol would be separately configured. Data collection would be identical in the code no matter which protocols are in use.

Changes to the metrics

Some words have specific meanings in the world of observability. Opentelemetry metrics explains these words - make particular note of counter and gauge which are currently used incorrectly within workflows. Counts MUST only ever go up.

A counter is something that ever-increases (unless the controller is restarted, in which case it will reset to 0, which the querying tool will understand and do the correct thing). You can use this to monitor changes using "increase" functions which allow you to monitor changes over your choice of time period. Gauges can be constructed from counters given both a count of incoming and outgoing events.

This is being done with the mindset of in future running multiple active controllers (horizontal controller scaling Horizontally Scalable Controller ). The existing metrics are hard to use in this context.

Existing metrics

Existing name Proposed name User reported problems
argo_pod_missing argo_workflows_pod_missing_count Match the prefix and change this to a counter of incidents. As a gauge you cannot easily tell what happened on workflow controller restart
argo_workflows_count argo_workflows_gauge This is emitting the number workflow resources that are in the cluster. If the user makes use of garbage collection and/or archiving, the workflows will no longer be counted (and the 'count' will go down).
new argo_workflows_total_count count workflows as they enter each phase (pending, running, errored, succeeded, failed, removed ) as a true counter
argo_workflows_error_count no change
argo_workflows_k8s_request_total no change
argo_workflows_operation_duration_seconds no change
argo_workflows_pods_count argo_workflows_pods_gauge it's a gauge
new argo_workflows_pods_total_count same arguments as for argo_workflows_total_count
argo_workflows_queue_adds_count no change
argo_workflows_queue_depth_count argo_workflows_queue_depth_gauge it is a gauge
argo_workflows_queue_latency no change
argo_workflows_workers_busy no change
argo_workflows_workflow_condition no change
argo_workflows_workflows_processed_count remove This has never functioned as is always 0.
log_messages argo_workflows_log_messages Document its existence

argo_workflows_total_count details

This would allow users to answer questions such as:

For a given time period, how many workflows were executed? How many passed? How many failed? How many errored?

New metrics

I know some of the items above are new, but they are related to existing metrics so documented along side them.

Name Other changes and notes
argo_workflows_controller_build_info would have a label of the version of the workflow-controller, os, architecture, go version, and git sha.
argo_workflows_cronworkflows_triggered_total would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through
argo_workflows_workflowtemplate_triggered_total would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through. Would only count top level workflowTemplateRef style running
argo_workflows_clusterworkflowtemplate_triggered_total As previous, but for ClusterWorkflowTemplates (and obviously no namespace label)
argo_workflows_workflowtemplate_runtime would have labels of name and namespace and count the same as the previous. histogram of time between trigger and completed

argo_workflows_workflowtemplate_used_total as a new metric

I'd like to count usage (which would not include status) of each template, wherever it is used. I'm not sure that there aren't hurdles to doing this correctly, so I'll put this as a maybe.

originName enhancements

If we have the originName argument sorted out (separate proposal) we could also have similar metrics as the workflowtemplate ones for originName (total by status and runtime histogram).

  • How many of my given originName is running right now?
  • What was the last start time of originName?
  • Did the last originName run fail?
  • Alert if it's been more than x days since the last originName workflow ran successfully

Dashboard

The grafana dashboard will be updated to use these new metrics

Couldn't we migrate to otel separately from changing metrics?

We could, but I think whilst we're in there it's better to fix both sides of this at the same time as there will be a fair amount of churn when doing either thing.

This is a breaking change

This change will allow you to continue to collect metrics using prometheus scraping and mostly things are similar.

The metric which are currently wrongly named are being deliberately broken and will no longer be emitted under their old name. They will still be available under a new name, so for minimal effort of updating you can just change any graphs to use the new name. This isn't the recommended course of action, and using the new metrics should give a better picture of the state of your cluster and the workflows it contains.

Leaving them with their old names means extra configuration flags and code complication for quite minimal benefit.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

@Joibel Joibel added type/feature Feature request area/controller Controller issues, panics labels Jan 30, 2024
@Joibel Joibel self-assigned this Jan 30, 2024
@Joibel Joibel added area/metrics prioritized-review For members of the Sustainability Effort labels Jan 30, 2024
@agilgur5
Copy link

agilgur5 commented Feb 4, 2024

I do not intend to implement tracing as part of this proposal.

For reference, we do have a separate feature request for that: #12077

The metric which are currently wrongly named are being deliberately broken

I am all for following standards and existing conventions. So long as we're explicit in the release notes, let's break those to fix them. As their current names are confusing / incorrect, I agree that there isn't much purpose in leaving them as-is for backward-compat. The faster we break broken things (to fix them), the better, as users have had less time to rely on or get used to them.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

I was wondering why there were two different endpoints; were these just the same data but differently formatted?
This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

@agilgur5
Copy link

agilgur5 commented Feb 4, 2024

Potentially related (I haven't root caused it yet): #12206 had remaining issues with metrics as well. We've fixed the root cause of the issue, but the metrics are still confusing. If we can roll those fixes into this, that'd be great, but they might require separate fixes.

@agilgur5
Copy link

agilgur5 commented Feb 6, 2024

I was wondering why there were two different endpoints; were these just the same data but differently formatted?
This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

Chatted about this during today's Contributor Meeting.
They are actually both Prometheus-style right now, but the /telemetry endpoint is for Golang metrics (e.g. number of goroutines etc) and the /metrics endpoint is Argo's own metrics (listed above).
I did find that part of the codebase in metrics/server.go. I think I was reading through metrics/metrics.go in the past, which does not distinguish between them.

@tvandinther
Copy link

These changes look good. They will simplify my approach to metrics and reduce the reliance on custom template metrics.

Although there might be some holes in the possibilities. To clarify, the main information I want to know to build a dashboard on is:

  • How many failures of a particular workflow (by template name) have occurred (and implicitly when)
  • How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)
  • How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

I think some of this is possible with the proposal and please correct me where I am wrong and that it is just a matter of query writing. For example, regarding workflow durations on template/step level as well as workflow level. I wonder is this something that is possible to craft a query for using the right labels on argo_workflows_total_count? Perhaps by taking the time delta between certain phase changes. Or does this make sense to be given its own metric?

In any case, these changes as described in the proposal will certainly be an improvement.

@Joibel
Copy link
Member Author

Joibel commented Apr 5, 2024

Responding to @tvandinther:

Metric time resolution will depend upon collection frequency:

How many failures of a particular workflow (by template name) have occurred (and implicitly when)

This is collectable (for templates run via workflowTemplateRef) from argo_workflows_workflowtemplate_triggered_total via the phase=Failed label.

How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)

Attempting to collect step/task level data isn't possible here. See below on tracing.

How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

This won't change. You can already label up the workflow, or do it via cronworkflow, event triggering and then you'll have something to join to your pod. We're not collecting step level information to avoid cardinality explosions.

argo_workflows_total_count won't answer the questions you're after at a workflow or step granular level because it's really only a counter replacement for argo_workflows_count which is actually a gauge.

Tracing

My intention is that once this is in I'll add opentelemetry tracing for workflows. Workflow runs will be a trace, and the individual parts of the workflow will be spans within it.

Using span to metrics you can then generate your own metrics from this information. This should be able to collect the detail you're after. I'll do some example span to metrics conversions for some of these things, because they're questions I'd like to be able to answer also.

@bok11
Copy link

bok11 commented May 13, 2024

I have a few use cases i like to see supported:

  1. Seeing the status of the individual DAG flows
    Say i have a workflow defined as:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: dummy
spec:
  serviceAccountName: pod-sa
  templates:
    - name: extract
      inputs: {}
      outputs: {}
      dag:
        tasks:
          - name: etl-source1
            arguments:
              parameters: []
            templateRef:
              name: work
              template: work
          - name: etl-source2
            arguments:
              parameters: []
            templateRef:
              name: work
              template: work
  entrypoint: extract
  outputs: {}
  metrics:
      prometheus:
            - name: job_last_success_timestamp
              labels:
                - key: job
                  value: dummy
                - key: job_namespace
                  value: '{{workflow.namespace}}'
                - key: job_UID
                  value: '{{workflow.uid}}'
              help: Unix timestamp of last successful job completion
              when: '{{status}} == Succeeded'
              gauge:
                value: '{{=workflow.creationTimestamp.s}}'
           - name: job_last_failure_timestamp
              labels:
                - key: job
                  value: dummy
                - key: job_namespace
                  value: '{{workflow.namespace}}'
                - key: job_UID
                  value: '{{workflow.uid}}'
              help: Unix timestamp of last failed job
              when: '{{status}} == Error'
              gauge:
                value: '{{=workflow.creationTimestamp.s}}'

Here I can see the timestamp of the last failed or succeeded run, but would like to also see which steps in the DAG failed.
My goal is: in the event some steps of the dag failed, i would like to still know when the last successful run of each etl-source was ran

  1. Custom metrics
    It would be cool to get information from the individual steps somehow.
    My suggestion is to allow parsing of json objects.
    Let us say i have a container, that outputs to std out:
{
    entitiesProcessed: 2000,
    entitiesFailed: 0,
    entitiesSkipped: 0,
}

Now i like to use these like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
 name: dummy
spec:
 serviceAccountName: pod-sa
 templates:
   - name: extract
     inputs: {}
     outputs: {}
     dag:
       tasks:
         - name: etl-source1
           arguments:
             parameters: []
           templateRef:
             name: work
             template: work
metrics:
    prometheus:          
          - name: entities_processed
             labels:
               - key: job
                 value: dummy
               - key: job_namespace
                 value: '{{workflow.namespace}}'
               - key: job_UID
                 value: '{{workflow.uid}}'
             help: Unix timestamp of last failed job
             when: '{{status}} == Error'
             gauge:
               value: '{{=workflow.DAG.etl-source1.output.entitiesProcessed}}'

@menzenski
Copy link
Contributor

menzenski commented May 21, 2024

Adding another vote of support for this proposal - we mostly run Argo Workflows via CronWorkflow resources that invoke WorkflowTemplate resources. The proposed new metrics argo_workflows_cronworkflows_triggered_total, argo_workflows_workflowtemplate_triggered_total, and argo_workflows_workflowtemplate_runtime will be extremely valuable for us.

If we have the originName argument sorted out (separate proposal)

I couldn't find this originName proposal, but this also sounds like it's something we'd be very interested in. (I did find #10745 ).

@agilgur5 agilgur5 removed the prioritized-review For members of the Sustainability Effort label Jun 16, 2024
@fstaudt
Copy link

fstaudt commented Jun 17, 2024

Hello @Joibel ,

we would also be very interested in this feature.

When reading the proposal, I still wonder which aggregation temporality will be used for opentelemetry metrics (sums or histograms): cumulative or delta ?

Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics.
It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up).

Which temporality had you in mind for this proposal ?
Could you consider supporting both temporalities and allow users to select the one they need by configuration of the controller ?

@Joibel
Copy link
Member Author

Joibel commented Jun 17, 2024

@fstaudt - I hadn't considered anyone would want anything other than the default in the otel SDK, which is cumulative.

The intended usage is with a collector between the controller and the data sink.

I may implement it as part of the initial commit, but I am wary of feature creep, so may wait to do it in a later PR. It should be relatively easy to implement either way.

Joibel added a commit that referenced this issue Jun 28, 2024
Led by #12589.

As discussed in #12589 OpenTelemetry is the future of
observability. This PR changes the underlying codebase to collect
metrics using the otel libraries, whilst attempting to retain
compatibility with prometheus scraping where this makes sense.

It helps lay the groundwork for adding workflow tracing using otel.

This PR amends and extends the built in metrics to be more useful and
correctly named.

This PR does not attempt do do anything else with otel, nor does it
attempt to change custom metrics.

Mostly removed prometheus libraries, and replaced with otel libraries.

Allows use of both prometheus `/metrics` scraping and opentelemetry
protocol transmission.

Extends the workqueue metrics to include all of them exposed by
[client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173)

Removed `argo_workflows_workflows_processed_count` as a
non-functioning metric.

`originName` enhancements proposed in #12589 were not added, nor
`argo_workflows_workflowtemplate_used_total`

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Jun 28, 2024
Led by #12589.

As discussed in #12589 OpenTelemetry is the future of
observability. This PR changes the underlying codebase to collect
metrics using the otel libraries, whilst attempting to retain
compatibility with prometheus scraping where this makes sense.

It helps lay the groundwork for adding workflow tracing using otel.

This PR amends and extends the built in metrics to be more useful and
correctly named.

This PR does not attempt do do anything else with otel, nor does it
attempt to change custom metrics.

Mostly removed prometheus libraries, and replaced with otel libraries.

Allows use of both prometheus `/metrics` scraping and opentelemetry
protocol transmission.

Extends the workqueue metrics to include all of them exposed by
[client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173)

Removed `argo_workflows_workflows_processed_count` as a
non-functioning metric.

`originName` enhancements proposed in #12589 were not added, nor
`argo_workflows_workflowtemplate_used_total`

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Jun 28, 2024
Implemented because of #12589 (comment)

Some backends (e.g. prometheus) only support cumulative temporality
while others (e.g. dynatrace) only support delta metrics.

It is possible to configure an opentelemetry collector between argo
workflow controller and the backend to convert cumulative metrics into
delta metrics but it has some limits (e.g. when controller is
restarted or when it is scaled up).

This commit enables the choice of temporality for OpenTelemetry

Temporality is always cumulative by definition in Prometheus

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Jun 28, 2024
One of the metrics #12589.

Add new, static metric which contains labels from the product version
fields.

This is common good practice, and puts the information about which
version of a controller was running that a given point in time.

This can be useful when diagnosing issues that occurred, and also
shows the progress of a version rollout when that happens.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Jun 28, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 14, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
Led by #12589.

As discussed in #12589 OpenTelemetry is the future of
observability. This PR changes the underlying codebase to collect
metrics using the otel libraries, whilst attempting to retain
compatibility with prometheus scraping where this makes sense.

It helps lay the groundwork for adding workflow tracing using otel.

This PR amends and extends the built in metrics to be more useful and
correctly named.

This PR does not attempt do do anything else with otel, nor does it
attempt to change custom metrics.

Mostly removed prometheus libraries, and replaced with otel libraries.

Allows use of both prometheus `/metrics` scraping and opentelemetry
protocol transmission.

Extends the workqueue metrics to include all of them exposed by
[client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173)

Removed `argo_workflows_workflows_processed_count` as a
non-functioning metric.

`originName` enhancements proposed in #12589 were not added, nor
`argo_workflows_workflowtemplate_used_total`

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
Implemented because of #12589 (comment)

Some backends (e.g. prometheus) only support cumulative temporality
while others (e.g. dynatrace) only support delta metrics.

It is possible to configure an opentelemetry collector between argo
workflow controller and the backend to convert cumulative metrics into
delta metrics but it has some limits (e.g. when controller is
restarted or when it is scaled up).

This commit enables the choice of temporality for OpenTelemetry

Temporality is always cumulative by definition in Prometheus

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
One of the metrics #12589.

Add new, static metric which contains labels from the product version
fields.

This is common good practice, and puts the information about which
version of a controller was running that a given point in time.

This can be useful when diagnosing issues that occurred, and also
shows the progress of a version rollout when that happens.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589.

A new metric which counts how many times each cron workflow has
triggered. A simple enough counter which can be checked against
expectations for the cron.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
Implemented because of #12589 (comment)

Some backends (e.g. prometheus) only support cumulative temporality
while others (e.g. dynatrace) only support delta metrics.

It is possible to configure an opentelemetry collector between argo
workflow controller and the backend to convert cumulative metrics into
delta metrics but it has some limits (e.g. when controller is
restarted or when it is scaled up).

This commit enables the choice of temporality for OpenTelemetry

Temporality is always cumulative by definition in Prometheus

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
One of the metrics #12589.

Add new, static metric which contains labels from the product version
fields.

This is common good practice, and puts the information about which
version of a controller was running that a given point in time.

This can be useful when diagnosing issues that occurred, and also
shows the progress of a version rollout when that happens.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
Implemented because of #12589 (comment)

Some backends (e.g. prometheus) only support cumulative temporality
while others (e.g. dynatrace) only support delta metrics.

It is possible to configure an opentelemetry collector between argo
workflow controller and the backend to convert cumulative metrics into
delta metrics but it has some limits (e.g. when controller is
restarted or when it is scaled up).

This commit enables the choice of temporality for OpenTelemetry

Temporality is always cumulative by definition in Prometheus

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
One of the metrics #12589.

Add new, static metric which contains labels from the product version
fields.

This is common good practice, and puts the information about which
version of a controller was running that a given point in time.

This can be useful when diagnosing issues that occurred, and also
shows the progress of a version rollout when that happens.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589.

A new metric which counts how many times each cron workflow has
triggered. A simple enough counter which can be checked against
expectations for the cron.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 15, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589.

A new metric which counts how many times each cron workflow has
triggered. A simple enough counter which can be checked against
expectations for the cron.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589.

A new metric which counts how many times each cron workflow has
triggered. A simple enough counter which can be checked against
expectations for the cron.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 16, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589

This is a new metric counting how many pods have gone into each pod
phase as observed by the controller.

This is like pods_gauge, but as a counter rather than a gauge.

The gauge is useful at telling you what is happening right now in
the cluster, but is not useful for long term statistics such as "How
many pods has workflows run" because it may never report some pods at
all. This counter can answer that question.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589.

A new metric which counts how many times each cron workflow has
triggered. A simple enough counter which can be checked against
expectations for the cron.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
Joibel added a commit that referenced this issue Aug 19, 2024
From #12589.

New metric `total_count` which is like the old `count` metric and the
new `gauge` metric, but a counter, not a gauge. The gauge shows a
snapshot of what is happening right now in the cluster, the counter
can answer questions like how many `Failed` workflows have there been
in the last 24 hours.

Two further metrics for counting uses of WorkflowTemplates via
workflowTemplateRef only. These store the name of the WorkflowTemplate
or ClusterWorkflowTemplate if the `cluster_scope` label is true, and
the namespace where it is used. `workflowtemplate_triggered_total`
counts the number of uses. `workflowtemplate_runtime` records how long
each phase the workflow running the template spent in seconds.

Note to reviewers: this is part of a stack of reviews for metrics
changes. Please don't merge until the rest of the stack is also ready.

Signed-off-by: Alan Clucas <[email protected]>
@agilgur5 agilgur5 added this to the v3.6.0 milestone Aug 19, 2024
@Joibel
Copy link
Member Author

Joibel commented Sep 12, 2024

This is merged as originally proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/metrics type/feature Feature request
Projects
None yet
Development

No branches or pull requests

7 participants