Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetricsError: unable to compute workflow.status in real-time metrics block with 'when' clause #12693

Open
3 of 4 tasks
jacek-jablonski opened this issue Feb 23, 2024 · 15 comments
Assignees
Labels
area/metrics area/templating Templating with `{{...}}` P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@jacek-jablonski
Copy link

jacek-jablonski commented Feb 23, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Using the workflow.status variable in the Prometheus metrics definition results in an error:

MetricsError: unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [<nil>] to COMPARATOR [==], unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [<nil>] to COMPARATOR [==], unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [<nil>] to COMPARATOR [==]

Version

v3.5.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: lovely-whale
  namespace: default
  annotations:
    workflows.argoproj.io/pod-name-format: v2
spec:
  templates:
    - name: argosay
      inputs:
        parameters:
          - name: message
            value: '{{workflow.parameters.message}}'
      outputs: {}
      metadata: {}
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - '{{inputs.parameters.message}}'
        resources: {}
  entrypoint: argosay
  arguments:
    parameters:
      - name: message
        value: hello argo
  serviceAccountName: default
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion
  metrics:
    prometheus:
      - name: job_duration
        labels:
          - key: name
            value: lovely-whale
        help: Duration gauge by workflow name
        when: '{{workflow.status}} == Running'
        gauge:
          value: '{{workflow.duration}}'
          realtime: true

Logs from the workflow controller

{"Phase":"","ResourceVersion":"1396711791","level":"info","msg":"Processing workflow","namespace":"default","time":"2024-02-23T12:15:58.455Z","workflow":"lovely-whale"}
{"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":0,"time":"2024-02-23T12:15:58.458Z","workflow":"lovely-whale"}
{"level":"error","msg":"unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [\u003cnil\u003e] to COMPARATOR [==]","namespace":"default","time":"2024-02-23T12:15:58.459Z","workflow":"lovely-whale"}
{"level":"info","msg":"Updated phase  -\u003e Running","namespace":"default","time":"2024-02-23T12:15:58.459Z","workflow":"lovely-whale"}
{"level":"warning","msg":"Node was nil, will be initialized as type Skipped","namespace":"default","time":"2024-02-23T12:15:58.459Z","workflow":"lovely-whale"}
{"level":"info","msg":"was unable to obtain node for , letting display name to be nodeName","namespace":"default","time":"2024-02-23T12:15:58.459Z","workflow":"lovely-whale"}
{"level":"info","msg":"Pod node lovely-whale initialized Pending","namespace":"default","time":"2024-02-23T12:15:58.459Z","workflow":"lovely-whale"}
{"level":"info","msg":"Created pod: lovely-whale (lovely-whale)","namespace":"default","time":"2024-02-23T12:15:58.482Z","workflow":"lovely-whale"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"default","time":"2024-02-23T12:15:58.482Z","workflow":"lovely-whale"}
{"level":"info","msg":"reconcileAgentPod","namespace":"default","time":"2024-02-23T12:15:58.482Z","workflow":"lovely-whale"}
{"level":"info","msg":"Workflow update successful","namespace":"default","phase":"Running","resourceVersion":"1396711796","time":"2024-02-23T12:15:58.492Z","workflow":"lovely-whale"}
{"Phase":"Running","ResourceVersion":"1396711796","level":"info","msg":"Processing workflow","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":1,"time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"task-result changed","namespace":"default","nodeID":"lovely-whale","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"error","msg":"unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [\u003cnil\u003e] to COMPARATOR [==]","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"node changed","namespace":"default","new.message":"","new.phase":"Succeeded","new.progress":"0/1","nodeID":"lovely-whale","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"reconcileAgentPod","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"Updated phase Running -\u003e Succeeded","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"Marking workflow completed","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"info","msg":"Marking workflow as pending archiving","namespace":"default","time":"2024-02-23T12:16:08.483Z","workflow":"lovely-whale"}
{"level":"error","msg":"unable to compute 'when' clause for metric 'lovely-whale': Invalid 'when' expression ' == Running': Cannot transition token types from UNKNOWN [\u003cnil\u003e] to COMPARATOR [==]","namespace":"default","time":"2024-02-23T12:16:08.484Z","workflow":"lovely-whale"}
{"action":"deletePod","key":"default/lovely-whale-1340600742-agent/deletePod","level":"info","msg":"cleaning up pod","time":"2024-02-23T12:16:08.489Z"}
{"level":"info","msg":"Workflow update successful","namespace":"default","phase":"Succeeded","resourceVersion":"1396711974","time":"2024-02-23T12:16:08.493Z","workflow":"lovely-whale"}
{"level":"info","msg":"archiving workflow","namespace":"default","time":"2024-02-23T12:16:08.500Z","uid":"2bc6d6f9-2ab3-4a0e-9051-f353bdc39f7f","workflow":"lovely-whale"}
{"level":"info","msg":"Queueing Succeeded workflow default/lovely-whale for delete in 5m0s due to TTL","time":"2024-02-23T12:16:08.526Z"}
{"action":"deletePod","key":"default/lovely-whale/deletePod","level":"info","msg":"cleaning up pod","time":"2024-02-23T12:16:13.501Z"}
{"level":"info","msg":"Queueing Succeeded workflow default/lovely-whale for delete due to max rention(10 workflows)","time":"2024-02-23T12:19:20.461Z"}
{"level":"info","msg":"Deleting garbage collected workflow 'default/lovely-whale'","time":"2024-02-23T12:19:20.461Z"}
{"level":"info","msg":"Successfully request 'default/lovely-whale' to be deleted","time":"2024-02-23T12:19:20.467Z"}
{"level":"info","msg":"Deleting garbage collected workflow 'default/lovely-whale'","time":"2024-02-23T12:21:09.000Z"}
{"level":"info","msg":"Workflow already deleted 'default/lovely-whale'","time":"2024-02-23T12:21:09.004Z"}
@agilgur5
Copy link

Looks like a follow-up to #9424 (comment)

@agilgur5 agilgur5 changed the title MetricsError: unable to compute workflow.status in metrics block with 'when' clause MetricsError: unable to compute workflow.status in metrics block with 'when' clause Feb 25, 2024
@agilgur5 agilgur5 added the area/templating Templating with `{{...}}` label Feb 25, 2024
@agilgur5 agilgur5 changed the title MetricsError: unable to compute workflow.status in metrics block with 'when' clause 3.5 MetricsError: unable to compute workflow.status in metrics block with 'when' clause Feb 25, 2024
@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Feb 25, 2024
@agilgur5
Copy link

agilgur5 commented Feb 25, 2024

Hmm this seems to have managed to regress twice despite the tests in #8939 🤔 Those might need adjusting.

The code from that PR still exists as-is, so something else is causing it to regress

@eduardodbr eduardodbr self-assigned this Feb 25, 2024
@jacek-jablonski
Copy link
Author

This is still relevant in v3.5.5.

@eduardodbr
Copy link
Member

When a metric is processed, the controller substitutes the variables in name,help, labels and when with the available parameter. If you configure a metric to be realtime, it will replace the variables while the workflow is still executing. When the whencondition {{workflow.status}} == Running is evaluated for the first time, the workflow is still in an unknown phase (which is the empty string ""), so the when condition is substituted by == Running which fails to resolve and throws the error Invalid 'when' expression ' == Running'. As you can see here there are only two variables that are supported by realtime metrics: workflow.duration and duration, and they are only supported as real time in the value field.

If you try an example with realtime: false everything works.

However, in your example, you don't need the when condition to create a metric with the duration of the workflow, it will stop increasing the gauge when the workflow finishes.

@eduardodbr
Copy link
Member

eduardodbr commented Mar 2, 2024

I'm removing the regression label as this has always been the controller's behavior for real-time metrics.

Right now variables used in name,help, labels and when are substituted as soon as the metric is processed for the first time. Works fine for non real-time metrics because they are only processed after the Workflow/step/task finishes, but may lead to errors like this if the metric is real-time. @agilgur5 do you think it is worth rethinking how real-time metrics are processed so that variables in these fields can be updated during execution?

@eduardodbr eduardodbr removed the type/regression Regression from previous behavior (a specific type of bug) label Mar 2, 2024
@Joibel
Copy link
Member

Joibel commented Mar 2, 2024

As part of implementing #12589 these need to meet the requirements imposed by opentelemetry as well.

I'll try and remember to comment properly when I am not on mobile.

@agilgur5
Copy link

agilgur5 commented Mar 2, 2024

Great root cause analysis here! ❤️

If you try an example with realtime: false everything works.

I'm removing the regression label as this has always been the controller's behavior for real-time metrics.

I didn't even notice the difference between #9424 and this that it's real-time, good catch!

@agilgur5 do you think it is worth rethinking how real-time metrics are processed so that variables in these fields can be updated during execution?

Honestly, I haven't worked with them enough to have a solid opinion. The time of processing is relatively well defined in the docs. Alan maybe has put some more thought into this?

but may lead to errors like this if the metric is real-time

Since this behavior is well defined right now, at the very least I would think that it would throw a validation error on submission, since workflow.status is not supposed to be available with real-time metrics.

Putting it a different way, if there were a validation error, this wouldn't have popped up as a bug. Instead it may have popped up as, for instance, a feature to add workflow.status to real-time variables.


Some other scattered, preliminary thoughts below:
For such a feature (not saying we should implement it per se if there aren't requests for it), I would think status could be supported and that Unknown == Running would then evaluate to false.
Or maybe that one could perhaps specify that certain fields of a metric are real-time while others are not -- although I think this is already achievable if combining metrics with other features like hooks (e.g. move the when clause into a hook which outputs the metric).

However, in your example, you don't need the when condition to create a metric with the duration of the workflow, it will stop increasing the gauge when the workflow finishes.

Removing the when entirely as you suggested is the simplest option in this case but there are perhaps other use-cases that might make sense; I'm not sure yet, haven't spent enough cycles on it.

@agilgur5
Copy link

agilgur5 commented Mar 2, 2024

Since this behavior is well defined right now, at the very least I would think that it would throw a validation error on submission, since workflow.status is

Oh, I just realized the expression here is using the old govaluate syntax, which complicates things a bit. I (and others before me) have long been planning to deprecate and remove it per #9529. The evaluation time was also a goal of that issue.
I had started some work on #7576 (comment) to soft deprecate govaluate by removing all its references in docs & examples (and replacing with expr) and was planning to then true deprecate it by logging warnings in 3.7 in preparation for a 3.8 removal (or something like that timeline-wise).
@eduardodbr if you're interested in that too (templating related), feel free to take some of that on too; I can help coordinate. I haven't had enough time to dig into templating as deep as I'd want to -- as I see you've been noticing in the issues, there's lot of little related bugs / features / mismatches etc in it that would be great to solve if 1 or 2 people could dive in for a bit 🙂

@agilgur5 agilgur5 changed the title 3.5 MetricsError: unable to compute workflow.status in metrics block with 'when' clause MetricsError: unable to compute workflow.status in real-time metrics block with 'when' clause Mar 2, 2024
@agilgur5 agilgur5 added P3 Low priority solution/workaround There's a workaround, might not be great, but exists and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Mar 2, 2024
@Joibel
Copy link
Member

Joibel commented Mar 4, 2024

Realtime metrics with a when clause don't make a huge amount of sense. You can't expect to sensibly track a number when it is sometimes not present. It would be possible to implement it, but I think it's more sensible to ban it.

The when statement should only really apply to a subset of prometheus metric capabilities, and this is not one of them.

name and help should be fixed.
If the name changes then this is weird behaviour
If the help changes then prometheus will tell you off (the current code), and the API for opentelemetry fundamentally doesn't provide a way to do this for a given name. You register a name with its help text, and it's fixed, you can't reregister. As help isn't stored, there is no way to look at historical help. In short Name:Help binding really should be fixed for life.

The labels and their values can change, this is fine.

Increasing the number of things you can do in real time metrics is really a question of looking at how variable scopes are handled. Getting other kinds of variables from the workflow (imagine that the controller has just restarted) is problematic.

Side note:
I've not looked into it yet, but a lot of invalid workflow/wftemplates are accepted by kubernetes, even really obviously invalid ones. This is really a topic for another thread.
We could ban when if the type is realtime using https://book.kubebuilder.io/reference/markers/crd-validation.html XValidation (which will only actually work in k8s >=1.27, but as 1.26 is now out of support everyone has upgraded, right? Right 😆 )

@agilgur5
Copy link

agilgur5 commented Mar 6, 2024

I've not looked into it yet, but a lot of invalid workflow/wftemplates are accepted by kubernetes, even really obviously invalid ones. This is really a topic for another thread.

I'm relatively sure this is because most installation paths (including the Helm chart) only use the minimal CRDs and not the full CRDs. And I believe the minimal ones are used because the full ones are too big (see also #11266).

If we don't already have a validating admission webhook, that's something we could use for Argo to deny creation of invalid resources. That can have any custom logic in it, but it is latency sensitive. Notably though, webhook configs may not be installed by users, and so the Controller and Server still have to handle invalid resources.

But yes definitely a separate thread 😅

@eduardodbr
Copy link
Member

@eduardodbr if you're interested in that too (templating related), feel free to take some of that on too; I can help coordinate. I haven't had enough time to dig into templating as deep as I'd want to -- as I see you've been noticing in the issues, there's lot of little related bugs / features / mismatches etc in it that would be great to solve if 1 or 2 people could dive in for a bit 🙂

Feel free to tag me whenever you see fit

@eduardodbr
Copy link
Member

should we close this?

@agilgur5
Copy link

agilgur5 commented Mar 8, 2024

As I wrote above, I would think this would throw a validation error as it uses an unsupported variable, rather than attempt to process it and then fail with a confusing error.

@eduardodbr
Copy link
Member

I think the hardest part is to know the list of unsupported variables. The user might want to use workflow.name in a real time metric and that seems to be fine because it never changes. However, variables like workflow.status or retries, that keep changing over time, cant be allowed. Is there any other way of validation other than having a "list" of unsupported variables? that is hard to maintain

@agilgur5
Copy link

agilgur5 commented Mar 8, 2024

Is there any other way of validation other than having a "list" of unsupported variables? that is hard to maintain

Having lists of supported and unsupported variables might actually be useful for validation purposes in general; very declarative & deterministic.

But I was thinking the inverse -- if a variable is not in the expression scope, it can throw a validation error. This issue sounds like the expression scope is broader than it should be for a realtime metric, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics area/templating Templating with `{{...}}` P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

No branches or pull requests

4 participants