-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: metrics overhaul including opentelemetry #12589
Comments
For reference, we do have a separate feature request for that: #12077
I am all for following standards and existing conventions. So long as we're explicit in the release notes, let's break those to fix them. As their current names are confusing / incorrect, I agree that there isn't much purpose in leaving them as-is for backward-compat. The faster we break broken things (to fix them), the better, as users have had less time to rely on or get used to them.
I was wondering why there were two different endpoints; were these just the same data but differently formatted? |
Potentially related (I haven't root caused it yet): #12206 had remaining issues with metrics as well. We've fixed the root cause of the issue, but the metrics are still confusing. If we can roll those fixes into this, that'd be great, but they might require separate fixes. |
Chatted about this during today's Contributor Meeting. |
These changes look good. They will simplify my approach to metrics and reduce the reliance on custom template metrics. Although there might be some holes in the possibilities. To clarify, the main information I want to know to build a dashboard on is:
I think some of this is possible with the proposal and please correct me where I am wrong and that it is just a matter of query writing. For example, regarding workflow durations on template/step level as well as workflow level. I wonder is this something that is possible to craft a query for using the right labels on In any case, these changes as described in the proposal will certainly be an improvement. |
Responding to @tvandinther: Metric time resolution will depend upon collection frequency:
This is collectable (for templates run via workflowTemplateRef) from
Attempting to collect step/task level data isn't possible here. See below on tracing.
This won't change. You can already label up the workflow, or do it via cronworkflow, event triggering and then you'll have something to join to your pod. We're not collecting step level information to avoid cardinality explosions.
TracingMy intention is that once this is in I'll add opentelemetry tracing for workflows. Workflow runs will be a trace, and the individual parts of the workflow will be spans within it. Using span to metrics you can then generate your own metrics from this information. This should be able to collect the detail you're after. I'll do some example span to metrics conversions for some of these things, because they're questions I'd like to be able to answer also. |
I have a few use cases i like to see supported:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
inputs: {}
outputs: {}
dag:
tasks:
- name: etl-source1
arguments:
parameters: []
templateRef:
name: work
template: work
- name: etl-source2
arguments:
parameters: []
templateRef:
name: work
template: work
entrypoint: extract
outputs: {}
metrics:
prometheus:
- name: job_last_success_timestamp
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last successful job completion
when: '{{status}} == Succeeded'
gauge:
value: '{{=workflow.creationTimestamp.s}}'
- name: job_last_failure_timestamp
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last failed job
when: '{{status}} == Error'
gauge:
value: '{{=workflow.creationTimestamp.s}}' Here I can see the timestamp of the last failed or succeeded run, but would like to also see which steps in the DAG failed.
{
entitiesProcessed: 2000,
entitiesFailed: 0,
entitiesSkipped: 0,
} Now i like to use these like this: apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: dummy
spec:
serviceAccountName: pod-sa
templates:
- name: extract
inputs: {}
outputs: {}
dag:
tasks:
- name: etl-source1
arguments:
parameters: []
templateRef:
name: work
template: work
metrics:
prometheus:
- name: entities_processed
labels:
- key: job
value: dummy
- key: job_namespace
value: '{{workflow.namespace}}'
- key: job_UID
value: '{{workflow.uid}}'
help: Unix timestamp of last failed job
when: '{{status}} == Error'
gauge:
value: '{{=workflow.DAG.etl-source1.output.entitiesProcessed}}' |
Adding another vote of support for this proposal - we mostly run Argo Workflows via CronWorkflow resources that invoke WorkflowTemplate resources. The proposed new metrics
I couldn't find this originName proposal, but this also sounds like it's something we'd be very interested in. (I did find #10745 ). |
Hello @Joibel , we would also be very interested in this feature. When reading the proposal, I still wonder which aggregation temporality will be used for opentelemetry metrics (sums or histograms): cumulative or delta ? Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. Which temporality had you in mind for this proposal ? |
@fstaudt - I hadn't considered anyone would want anything other than the default in the otel SDK, which is cumulative. The intended usage is with a collector between the controller and the data sink. I may implement it as part of the initial commit, but I am wary of feature creep, so may wait to do it in a later PR. It should be relatively easy to implement either way. |
Led by #12589. As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense. It helps lay the groundwork for adding workflow tracing using otel. This PR amends and extends the built in metrics to be more useful and correctly named. This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics. Mostly removed prometheus libraries, and replaced with otel libraries. Allows use of both prometheus `/metrics` scraping and opentelemetry protocol transmission. Extends the workqueue metrics to include all of them exposed by [client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173) Removed `argo_workflows_workflows_processed_count` as a non-functioning metric. `originName` enhancements proposed in #12589 were not added, nor `argo_workflows_workflowtemplate_used_total` Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Led by #12589. As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense. It helps lay the groundwork for adding workflow tracing using otel. This PR amends and extends the built in metrics to be more useful and correctly named. This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics. Mostly removed prometheus libraries, and replaced with otel libraries. Allows use of both prometheus `/metrics` scraping and opentelemetry protocol transmission. Extends the workqueue metrics to include all of them exposed by [client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173) Removed `argo_workflows_workflows_processed_count` as a non-functioning metric. `originName` enhancements proposed in #12589 were not added, nor `argo_workflows_workflowtemplate_used_total` Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Led by #12589. As discussed in #12589 OpenTelemetry is the future of observability. This PR changes the underlying codebase to collect metrics using the otel libraries, whilst attempting to retain compatibility with prometheus scraping where this makes sense. It helps lay the groundwork for adding workflow tracing using otel. This PR amends and extends the built in metrics to be more useful and correctly named. This PR does not attempt do do anything else with otel, nor does it attempt to change custom metrics. Mostly removed prometheus libraries, and replaced with otel libraries. Allows use of both prometheus `/metrics` scraping and opentelemetry protocol transmission. Extends the workqueue metrics to include all of them exposed by [client-go/workqueue](https://github.com/kubernetes/client-go/blob/babfe96631905312244f34500d2fd1a8a10b186c/util/workqueue/metrics.go#L173) Removed `argo_workflows_workflows_processed_count` as a non-functioning metric. `originName` enhancements proposed in #12589 were not added, nor `argo_workflows_workflowtemplate_used_total` Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
Implemented because of #12589 (comment) Some backends (e.g. prometheus) only support cumulative temporality while others (e.g. dynatrace) only support delta metrics. It is possible to configure an opentelemetry collector between argo workflow controller and the backend to convert cumulative metrics into delta metrics but it has some limits (e.g. when controller is restarted or when it is scaled up). This commit enables the choice of temporality for OpenTelemetry Temporality is always cumulative by definition in Prometheus Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
One of the metrics #12589. Add new, static metric which contains labels from the product version fields. This is common good practice, and puts the information about which version of a controller was running that a given point in time. This can be useful when diagnosing issues that occurred, and also shows the progress of a version rollout when that happens. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589 This is a new metric counting how many pods have gone into each pod phase as observed by the controller. This is like pods_gauge, but as a counter rather than a gauge. The gauge is useful at telling you what is happening right now in the cluster, but is not useful for long term statistics such as "How many pods has workflows run" because it may never report some pods at all. This counter can answer that question. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. A new metric which counts how many times each cron workflow has triggered. A simple enough counter which can be checked against expectations for the cron. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
From #12589. New metric `total_count` which is like the old `count` metric and the new `gauge` metric, but a counter, not a gauge. The gauge shows a snapshot of what is happening right now in the cluster, the counter can answer questions like how many `Failed` workflows have there been in the last 24 hours. Two further metrics for counting uses of WorkflowTemplates via workflowTemplateRef only. These store the name of the WorkflowTemplate or ClusterWorkflowTemplate if the `cluster_scope` label is true, and the namespace where it is used. `workflowtemplate_triggered_total` counts the number of uses. `workflowtemplate_runtime` records how long each phase the workflow running the template spent in seconds. Note to reviewers: this is part of a stack of reviews for metrics changes. Please don't merge until the rest of the stack is also ready. Signed-off-by: Alan Clucas <[email protected]>
This is merged as originally proposed. |
Proposal for improving workflows metrics
This proposal is an overhaul of workflow metrics to bring them up to date, correct them, and make sure they are useful. It is intended to be a breaking change requiring users to update their use of metrics.
Opentelemetery
I'd like to move to using opentelemetry (otel) as the libraries for collecting metrics. They are mature and the design of the interfaces encourages good practices.
We'd still be able to expose the resulting metrics as a prometheus compatible
/metrics
scrapable endpoint using any prometheus compatible scraper.It would allow us to also emit the same metrics using the opentelemetry protocol with the benefits this provides. This would also allow us to emit tracing using the same protocol and libraries. I do not intend to implement tracing as part of this proposal.
Each transmission protocol would be separately configured. Data collection would be identical in the code no matter which protocols are in use.
Changes to the metrics
Some words have specific meanings in the world of observability. Opentelemetry metrics explains these words - make particular note of counter and gauge which are currently used incorrectly within workflows. Counts MUST only ever go up.
A counter is something that ever-increases (unless the controller is restarted, in which case it will reset to 0, which the querying tool will understand and do the correct thing). You can use this to monitor changes using "increase" functions which allow you to monitor changes over your choice of time period. Gauges can be constructed from counters given both a count of incoming and outgoing events.
This is being done with the mindset of in future running multiple active controllers (horizontal controller scaling Horizontally Scalable Controller ). The existing metrics are hard to use in this context.
Existing metrics
argo_workflows_count
argo_workflows_gauge
argo_workflows_total_count
pending
,running
,errored
,succeeded
,failed
,removed
) as a true counterargo_workflows_error_count
argo_workflows_k8s_request_total
argo_workflows_operation_duration_seconds
argo_workflows_pods_count
argo_workflows_pods_gauge
argo_workflows_pods_total_count
argo_workflows_total_count
argo_workflows_queue_adds_count
argo_workflows_queue_depth_count
argo_workflows_queue_depth_gauge
argo_workflows_queue_latency
argo_workflows_workers_busy
argo_workflows_workflow_condition
argo_workflows_workflows_processed_count
log_messages
argo_workflows_log_messages
argo_workflows_total_count
detailsThis would allow users to answer questions such as:
For a given time period, how many workflows were executed? How many passed? How many failed? How many errored?
New metrics
I know some of the items above are new, but they are related to existing metrics so documented along side them.
argo_workflows_controller_build_info
argo_workflows_cronworkflows_triggered_total
argo_workflows_workflowtemplate_triggered_total
argo_workflows_clusterworkflowtemplate_triggered_total
argo_workflows_workflowtemplate_runtime
argo_workflows_workflowtemplate_used_total
as a new metricI'd like to count usage (which would not include status) of each template, wherever it is used. I'm not sure that there aren't hurdles to doing this correctly, so I'll put this as a maybe.
originName enhancements
If we have the originName argument sorted out (separate proposal) we could also have similar metrics as the workflowtemplate ones for originName (total by status and runtime histogram).
Dashboard
The grafana dashboard will be updated to use these new metrics
Couldn't we migrate to otel separately from changing metrics?
We could, but I think whilst we're in there it's better to fix both sides of this at the same time as there will be a fair amount of churn when doing either thing.
This is a breaking change
This change will allow you to continue to collect metrics using prometheus scraping and mostly things are similar.
The metric which are currently wrongly named are being deliberately broken and will no longer be emitted under their old name. They will still be available under a new name, so for minimal effort of updating you can just change any graphs to use the new name. This isn't the recommended course of action, and using the new metrics should give a better picture of the state of your cluster and the workflows it contains.
Leaving them with their old names means extra configuration flags and code complication for quite minimal benefit.
There will also not be a separate prometheus scrape point for metrics and telemetry. This doesn't make any sense anyway. The config for
telemetry
will be removed.The text was updated successfully, but these errors were encountered: