proposal: metrics overhaul including opentelemetry #12589

Joibel · 2024-01-30T09:28:44Z

Proposal for improving workflows metrics

This proposal is an overhaul of workflow metrics to bring them up to date, correct them, and make sure they are useful. It is intended to be a breaking change requiring users to update their use of metrics.

Opentelemetery

I'd like to move to using opentelemetry (otel) as the libraries for collecting metrics. They are mature and the design of the interfaces encourages good practices.

We'd still be able to expose the resulting metrics as a prometheus compatible /metrics scrapable endpoint using any prometheus compatible scraper.

It would allow us to also emit the same metrics using the opentelemetry protocol with the benefits this provides. This would also allow us to emit tracing using the same protocol and libraries. I do not intend to implement tracing as part of this proposal.

Each transmission protocol would be separately configured. Data collection would be identical in the code no matter which protocols are in use.

Changes to the metrics

Some words have specific meanings in the world of observability. Opentelemetry metrics explains these words - make particular note of counter and gauge which are currently used incorrectly within workflows. Counts MUST only ever go up.

A counter is something that ever-increases (unless the controller is restarted, in which case it will reset to 0, which the querying tool will understand and do the correct thing). You can use this to monitor changes using "increase" functions which allow you to monitor changes over your choice of time period. Gauges can be constructed from counters given both a count of incoming and outgoing events.

This is being done with the mindset of in future running multiple active controllers (horizontal controller scaling Horizontally Scalable Controller ). The existing metrics are hard to use in this context.

Existing metrics

Existing name	Proposed name	User reported problems
argo_pod_missing	argo_workflows_pod_missing_count	Match the prefix and change this to a counter of incidents. As a gauge you cannot easily tell what happened on workflow controller restart
`argo_workflows_count`	`argo_workflows_gauge`	This is emitting the number workflow resources that are in the cluster. If the user makes use of garbage collection and/or archiving, the workflows will no longer be counted (and the 'count' will go down).
new	`argo_workflows_total_count`	count workflows as they enter each phase (`pending`, `running`, `errored`, `succeeded`, `failed`, `removed` ) as a true counter
`argo_workflows_error_count`	no change
`argo_workflows_k8s_request_total`	no change
`argo_workflows_operation_duration_seconds`	no change
`argo_workflows_pods_count`	`argo_workflows_pods_gauge`	it's a gauge
new	`argo_workflows_pods_total_count`	same arguments as for `argo_workflows_total_count`
`argo_workflows_queue_adds_count`	no change
`argo_workflows_queue_depth_count`	`argo_workflows_queue_depth_gauge`	it is a gauge
`argo_workflows_queue_latency`	no change
`argo_workflows_workers_busy`	no change
`argo_workflows_workflow_condition`	no change
`argo_workflows_workflows_processed_count`	remove	This has never functioned as is always 0.
`log_messages`	`argo_workflows_log_messages`	Document its existence

`argo_workflows_total_count` details

This would allow users to answer questions such as:

For a given time period, how many workflows were executed? How many passed? How many failed? How many errored?

New metrics

I know some of the items above are new, but they are related to existing metrics so documented along side them.

Name	Other changes and notes
`argo_workflows_controller_build_info`	would have a label of the version of the workflow-controller, os, architecture, go version, and git sha.
`argo_workflows_cronworkflows_triggered_total`	would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through
`argo_workflows_workflowtemplate_triggered_total`	would have labels of name, namespace and status, and be incremented for each of the phases the workflow goes through. Would only count top level workflowTemplateRef style running
`argo_workflows_clusterworkflowtemplate_triggered_total`	As previous, but for ClusterWorkflowTemplates (and obviously no namespace label)
`argo_workflows_workflowtemplate_runtime`	would have labels of name and namespace and count the same as the previous. histogram of time between trigger and completed

`argo_workflows_workflowtemplate_used_total` as a new metric

I'd like to count usage (which would not include status) of each template, wherever it is used. I'm not sure that there aren't hurdles to doing this correctly, so I'll put this as a maybe.

originName enhancements

If we have the originName argument sorted out (separate proposal) we could also have similar metrics as the workflowtemplate ones for originName (total by status and runtime histogram).

How many of my given originName is running right now?
What was the last start time of originName?
Did the last originName run fail?
Alert if it's been more than x days since the last originName workflow ran successfully

Dashboard

The grafana dashboard will be updated to use these new metrics

Couldn't we migrate to otel separately from changing metrics?

We could, but I think whilst we're in there it's better to fix both sides of this at the same time as there will be a fair amount of churn when doing either thing.

This is a breaking change

This change will allow you to continue to collect metrics using prometheus scraping and mostly things are similar.

The metric which are currently wrongly named are being deliberately broken and will no longer be emitted under their old name. They will still be available under a new name, so for minimal effort of updating you can just change any graphs to use the new name. This isn't the recommended course of action, and using the new metrics should give a better picture of the state of your cluster and the workflows it contains.

Leaving them with their old names means extra configuration flags and code complication for quite minimal benefit.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-02-04T17:59:49Z

I do not intend to implement tracing as part of this proposal.

For reference, we do have a separate feature request for that: #12077

The metric which are currently wrongly named are being deliberately broken

I am all for following standards and existing conventions. So long as we're explicit in the release notes, let's break those to fix them. As their current names are confusing / incorrect, I agree that there isn't much purpose in leaving them as-is for backward-compat. The faster we break broken things (to fix them), the better, as users have had less time to rely on or get used to them.

There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for telemetry will be removed.

I was wondering why there were two different endpoints; were these just the same data but differently formatted?
This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

agilgur5 · 2024-02-04T18:04:20Z

Potentially related (I haven't root caused it yet): #12206 had remaining issues with metrics as well. We've fixed the root cause of the issue, but the metrics are still confusing. If we can roll those fixes into this, that'd be great, but they might require separate fixes.

agilgur5 · 2024-02-06T23:41:32Z

I was wondering why there were two different endpoints; were these just the same data but differently formatted?
This would simplify a number of things for operators and also for the Helm chart, b/c it is pretty confusing rn IMO

Chatted about this during today's Contributor Meeting.
They are actually both Prometheus-style right now, but the /telemetry endpoint is for Golang metrics (e.g. number of goroutines etc) and the /metrics endpoint is Argo's own metrics (listed above).
I did find that part of the codebase in metrics/server.go. I think I was reading through metrics/metrics.go in the past, which does not distinguish between them.

tvandinther · 2024-04-05T08:58:44Z

These changes look good. They will simplify my approach to metrics and reduce the reliance on custom template metrics.

Although there might be some holes in the possibilities. To clarify, the main information I want to know to build a dashboard on is:

How many failures of a particular workflow (by template name) have occurred (and implicitly when)
How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)
How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

I think some of this is possible with the proposal and please correct me where I am wrong and that it is just a matter of query writing. For example, regarding workflow durations on template/step level as well as workflow level. I wonder is this something that is possible to craft a query for using the right labels on argo_workflows_total_count? Perhaps by taking the time delta between certain phase changes. Or does this make sense to be given its own metric?

In any case, these changes as described in the proposal will certainly be an improvement.

Joibel · 2024-04-05T09:41:09Z

Responding to @tvandinther:

Metric time resolution will depend upon collection frequency:

How many failures of a particular workflow (by template name) have occurred (and implicitly when)

This is collectable (for templates run via workflowTemplateRef) from argo_workflows_workflowtemplate_triggered_total via the phase=Failed label.

How long was the workflow and each step in each of the various states (by virtue of the time they entered and exited those states)

Attempting to collect step/task level data isn't possible here. See below on tracing.

How much resources did the workflow and steps use? This one is likely best answered with some kind of join with container metrics so having a label to join on would be important. Then by extension I can also answer the question, what proportion of container resource requests was used so that I can optimise it.

This won't change. You can already label up the workflow, or do it via cronworkflow, event triggering and then you'll have something to join to your pod. We're not collecting step level information to avoid cardinality explosions.

argo_workflows_total_count won't answer the questions you're after at a workflow or step granular level because it's really only a counter replacement for argo_workflows_count which is actually a gauge.

Tracing

My intention is that once this is in I'll add opentelemetry tracing for workflows. Workflow runs will be a trace, and the individual parts of the workflow will be spans within it.

Using span to metrics you can then generate your own metrics from this information. This should be able to collect the detail you're after. I'll do some example span to metrics conversions for some of these things, because they're questions I'd like to be able to answer also.

bok11 · 2024-05-13T13:28:24Z

I have a few use cases i like to see supported:

Seeing the status of the individual DAG flows
Say i have a workflow defined as:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: dummy
spec:
  serviceAccountName: pod-sa
  templates:
    - name: extract
      inputs: {}
      outputs: {}
      dag:
        tasks:
          - name: etl-source1
            arguments:
              parameters: []
            templateRef:
              name: work
              template: work
          - name: etl-source2
            arguments:
              parameters: []
            templateRef:
              name: work
              template: work
  entrypoint: extract
  outputs: {}
  metrics:
      prometheus:
            - name: job_last_success_timestamp
              labels:
                - key: job
                  value: dummy
                - key: job_namespace
                  value: '{{workflow.namespace}}'
                - key: job_UID
                  value: '{{workflow.uid}}'
              help: Unix timestamp of last successful job completion
              when: '{{status}} == Succeeded'
              gauge:
                value: '{{=workflow.creationTimestamp.s}}'
           - name: job_last_failure_timestamp
              labels:
                - key: job
                  value: dummy
                - key: job_namespace
                  value: '{{workflow.namespace}}'
                - key: job_UID
                  value: '{{workflow.uid}}'
              help: Unix timestamp of last failed job
              when: '{{status}} == Error'
              gauge:
                value: '{{=workflow.creationTimestamp.s}}'

Here I can see the timestamp of the last failed or succeeded run, but would like to also see which steps in the DAG failed.
My goal is: in the event some steps of the dag failed, i would like to still know when the last successful run of each etl-source was ran

Custom metrics
It would be cool to get information from the individual steps somehow.
My suggestion is to allow parsing of json objects.
Let us say i have a container, that outputs to std out:

{
    entitiesProcessed: 2000,
    entitiesFailed: 0,
    entitiesSkipped: 0,
}

Now i like to use these like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
 name: dummy
spec:
 serviceAccountName: pod-sa
 templates:
   - name: extract
     inputs: {}
     outputs: {}
     dag:
       tasks:
         - name: etl-source1
           arguments:
             parameters: []
           templateRef:
             name: work
             template: work
metrics:
    prometheus:          
          - name: entities_processed
             labels:
               - key: job
                 value: dummy
               - key: job_namespace
                 value: '{{workflow.namespace}}'
               - key: job_UID
                 value: '{{workflow.uid}}'
             help: Unix timestamp of last failed job
             when: '{{status}} == Error'
             gauge:
               value: '{{=workflow.DAG.etl-source1.output.entitiesProcessed}}'

menzenski · 2024-05-21T16:17:16Z

Adding another vote of support for this proposal - we mostly run Argo Workflows via CronWorkflow resources that invoke WorkflowTemplate resources. The proposed new metrics argo_workflows_cronworkflows_triggered_total, argo_workflows_workflowtemplate_triggered_total, and argo_workflows_workflowtemplate_runtime will be extremely valuable for us.

If we have the originName argument sorted out (separate proposal)

I couldn't find this originName proposal, but this also sounds like it's something we'd be very interested in. (I did find #10745 ).

fstaudt · 2024-06-17T19:00:33Z

Hello @Joibel ,

we would also be very interested in this feature.

When reading the proposal, I still wonder which aggregation temporality will be used for opentelemetry metrics (sums or histograms): cumulative or delta ?

Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics.
It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up).

Which temporality had you in mind for this proposal ?
Could you consider supporting both temporalities and allow users to select the one they need by configuration of the controller ?

Joibel · 2024-06-17T19:44:27Z

@fstaudt - I hadn't considered anyone would want anything other than the default in the otel SDK, which is cumulative.

The intended usage is with a collector between the controller and the data sink.

I may implement it as part of the initial commit, but I am wary of feature creep, so may wait to do it in a later PR. It should be relatively easy to implement either way.

Led by #12589. As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense. It helps lay the groundwork for adding workflow tracing using otel. This PR amends and extends the built in metrics to be more useful and correctly named. This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics. Mostly removed prometheus libraries, and replaced with otel libraries. Allows use of both prometheus `/metrics` scraping and opentelemetry protocol transmission. Extends the workqueue metrics to include all of them exposed by [client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173) Removed `argo_workflows_workflows_processed_count` as a non-functioning metric. `originName` enhancements proposed in #12589 were not added, nor `argo_workflows_workflowtemplate_used_total` Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

Led by #12589. As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense. It helps lay the groundwork for adding workflow tracing using otel. This PR amends and extends the built in metrics to be more useful and correctly named. This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics. Mostly removed prometheus libraries, and replaced with otel libraries. Allows use of both prometheus `/metrics` scraping and opentelemetry protocol transmission. Extends the workqueue metrics to include all of them exposed by [client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173) Removed `argo_workflows_workflows_processed_count` as a non-functioning metric. `originName` enhancements proposed in #12589 were not added, nor `argo_workflows_workflowtemplate_used_total` Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>

From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>