Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] {{workflow.status}} and {{workflow.failures}} Behavior in KFP v2 dsl.ExitHandler versus v1 #10917

Open
tom-pavz opened this issue Jun 17, 2024 · 14 comments
Assignees
Labels
Milestone

Comments

@tom-pavz
Copy link

tom-pavz commented Jun 17, 2024

Environment

  • How do you deploy Kubeflow Pipelines (KFP)?
    deployKF
  • KFP version:
    2.1.0
  • KFP SDK version:
    2.7.0

Steps to reproduce

We use dsl.ExitHanlder plus Argo Workflow's Workflow Variables to automatically report failed pipeline runs to slack. This was working well in KFP v1, but no longer in v2.

I have a simple hello-world pipeline that has the following fail component wrapped in a dsl.ExitHanlder with an exit_task of post_msg_to_slack_on_pipeline_fail.

@component
def fail():
    import sys

    sys.exit(1)
@component
def post_msg_to_slack_on_pipeline_fail():
    import json

    if "{{workflow.status}}" == "Failed":
        json_object = json.loads('{{workflow.failures}}'[1:-1])
        json_formatted_str = json.dumps(json_object, indent=2)
        # send alert to slack
@pipeline
def pipeline():
    post_msg_to_slack_on_pipeline_fail_task = post_msg_to_slack_on_pipeline_fail_op()

    with ExitHandler(exit_task=post_msg_to_slack_on_pipeline_fail_task):
        hello_task = hello_world_op()
        fail_task = fail_op()
        fail_task.after(hello_task)

Starting in KFP v2, at the time of the exit_task runtime, {{workflow.status}} gets a value of Running and the {{workflow.failures}} does not get any value and just prints out {{workflow.failures}} itself. It seems that for some reason in KFP V2, the failure of the task does not propagate up and out to the pipeline itself by the time the exit_task is running. In addition, I notice that the pipeline itself has an "Executed Successfully" status in the UI (see below screenshots of the DAG and sub-DAG), even though one of its tasks failed, which does not seem correct to me. I also notice that in the UI the exit-handler-1 sub-DAG is stuck in "Running" status, which also seems incorrect.

Expected result

The expected result, and the behavior we are getting when using v1 of KFP, was that at the time of the exit_task runtime, {{workflow.status}} got a value of Failed and the {{workflow.failures}} got a json file with information about the failed tasks of the pipeline, which we could then send to slack.

Materials and reference

Screenshots of the pipeline and its DAGs' in the UI:
Screenshot 2024-06-17 at 12 40 38 PM
Screenshot 2024-06-17 at 12 40 51 PM

I also attempted to use the dsl.PipelineTaskFinalStatus in a dsl.ExitHandler as detailed here but when trying to run the pipeline I am getting the following error.

{"error":"Internal error: compiling DAG \"root\": failed to add implicit deps: task final status not supported yet\nFailed to compile job

I saw this issue on this, but it was never really addressed as I suppose it was created in the wrong repo.


Impacted by this bug? Give it a 👍.

@rimolive
Copy link
Member

This version of KFP came from a fork of the distribution, so I'm not sure how much it was in sync with the actual KFP 2.1.0. We also released KFP 2.2.0, and this is where we upgraded to Argo 3.4.x (which in part is what the issue in question should be resolved). Please, test with the latest KFP community release and let us know if the issue is fixed.

@thesuperzapper
Copy link
Member

@rimolive for reference, the fork is functionally identical for Kubeflow Pipeline 2.1.0 the only changes are about object store authentication for KFP v1.

The first issue raised by @tom-pavz may have been fixed in Kubeflow Pipelines 2.2.0, but it seems unlikely given I don't see any changes to the ExitHandler in 2.2.0.

Because @tom-pavz gave a very simple example, it should be straightforward to check on a vanilla Kubeflow Pipelines 2.2.0, perhaps even @tom-pavz could do that.


However, the second issue (using dsl.PipelineTaskFinalStatus in a dsl.ExitHandler) is definitely still not supported in 2.2.0, I can see the exact error code on this line:

@tom-pavz
Copy link
Author

@rimolive @thesuperzapper Thanks for the responses! I am not super confident on how to test upgrading to 2.2.0 on my cluster without potentially introducing other unrelated issues. However, if no changes were made from 2.1.0 to 2.2.0 on the ExitHandler, can't we assume that this strange behavior I reported is still happening?

As for the dsl.PipelineTaskFinalStatus I guess I'm pretty surprised that this is broken in current KFP versions as it has been a part of the V2 KFP documentation for so long, how has no one else noticed that it is just not supported at all?

@thesuperzapper
Copy link
Member

@tom-pavz , regarding testing, I was more meaning to just deploy the raw manifests (not deployKF) on a testing cluster (even a local one) to confirm if the issue is still present.

Although, 1.9.0 is not technically out, so there is no way to use KFP 2.2.0 from the manifests yet (except with the release candidates).

But yes, it is very likely that both issues still exists in KFP 2.2.0.

@roytman
Copy link
Contributor

roytman commented Jun 19, 2024

Hi @rimolive, we use the latest KFP version (2.2.0) and the latest SDK (2.7.0) and I can confirm that the issue exists. The ExitHandler somehow wraps failures and KFP UI reports SUCCESS for failed Runs. Therefore users have to check internally the execution status of each Run.

@tom-pavz
Copy link
Author

Is the KFP community planning on fixing this in a future release now that it has been confirmed a couple of ways that it is indeed an issue? Seems like a regression in the ExitHandler functionality from v1 to v2.

Also, I think kubeflow/kubeflow#7422 is definitely a real issue, albeit created in the wrong place. Should another issue be created in this repo for it, or should we use this issue for both of these issues?

@HumairAK
Copy link
Collaborator

HumairAK commented Aug 1, 2024

/assign HumairAK

@HumairAK
Copy link
Collaborator

HumairAK commented Aug 14, 2024

Okay, digged into this a bit.

tldr: we should try to use Argo Lifecycle Hooks here instead of having the exit_task be another sibling task to the ExitHandler.

Terminologies:

  • ExitHandler: Consists of a DAG of a group of tasks (taskgroup) and an exit_task
  • exit_task: The task that is invoked after exiting a group of other tasks.

There seem to be 2 user concerns here:

  1. Incorrect run status reporting for Runs using ExitHandler: pipeline overall run status inaccurately reports Succeeded when using an ExitHandler that includes a failing task, but succeeding exit_task
  2. Unable to detect ExitHandler status in exit_task: There is no way to access the status of the failed/error'd tasks in the exit handler's task group. This should satisfied by PipelineFinalTask but this is currently does not work in KFP.

I think it makes sense to have it in one issue however, as they are both very much related.

1. Incorrect run status reporting for Runs using `ExitHandler

In KFP V2 we do not use Argo's OnExit for KFP Exithandler. Based on the OP it seems like we used to do this in KFP V1. During Argo compilation in KFP V2, using an Exithandler results in another DAG, with the exit_task as the dependent task to this ExitHandler DAG.

The following illustrates what the exit_task's driver task depends looks like:

...
- arguments:
    parameters: ...
  depends: exit-handler-1.Succeeded || exit-handler-1.Skipped || exit-handler-1.Failed || exit-handler-1.Errored
  name: post-msg-driver # exit_task
  template: system-container-driver
...

As you can see the depends field will result in this task execution as soon as the exit-handler dag finishes. In Argo today, this results in the workflow succeeding if the remaining downstream tasks (post-msg in this case) succeed.

In the future this may be configurable.

There is a workaround suggested in the linked thread, however it requires us to match the exit status of the exit_status task with that of the tasks in the exit handler task group. I don't think we want to go this route.

Suggested Solutions:

I can see 2 ways we could resolve this.

One is to go back to using OnExit. Based on today's KFP community meeting call, @chensun suggested that one of the reasons why we did not go with argo's OnExit in V2 was because Argo didn't support multiple exithandlers. I suspect another reason might have been because it also didn't support passing parameters. This has since changed as Argo has introduced a more robust exit handler via a mechanism known as LifeCycle-Hook. Hooks should resolve most if not all these concerns, so my suggestion is we first attempt to utilize this method. Note that hooks can be implemented at the Argo template level.

If we can't use Hooks, another suggestion is for to adjust the pipeline server logic to instead be more "intelligent" in how we evaluate the Pipeline Run's condition, currently we just look at the Argo Workflow's phase field (i.e. here).

2. Unable to detect ExitHandler status in exit_task:

The variables {{workflow.status}} and {{workflow.failures}} are Argo ExitHandler specific.

We no longer want Argo implementation bleeding into pipelines, as the pipeline ought to be engine agnostic. So we should be providing better alternatives to the Argo Exithandler variables. We should instead work on supporting dsl.PipelineTaskFinalStatus.

Suggested Solutions:
Once again we might be able to leverage Argo lifecycle hooks as an ExitHandler, then we can populate PipelineTaskFinalStatus with some combination of {{workflow.status}}, {{workflow.failures}}, and task output parameters internally without having the user to do this at the sdk stage.

If we don't go with hooks, then we could maybe leverage task.*.status and some combination of task input/output parameters to fill out the PipelineTaskFinalStatus fields.

My suggested path forward here is to first attempt to utilize the lifecycle hook here instead of treating the exit_task as yet another task within the parent dag. This should allow us to resolve both concerns.

@chensun / @zijianjoy let me know if there's anything obvious that stands out to you here for why we should not try to use hooks here.

@gregsheremeta
Copy link
Contributor

Link to KFP community meeting where this issue was discussed

@gregsheremeta
Copy link
Contributor

tldr: we should try to use Argo Lifecycle Hooks here instead of having the exit_task be another sibling task to the ExitHandler.

Thanks for the great writeup above. This approach makes sense to me.

@HumairAK HumairAK added this to the KFP 2.4.0 milestone Oct 8, 2024
@MarkTNO
Copy link

MarkTNO commented Nov 4, 2024

We encounter the same issue here which is blocking our move from kfp v1 to v2.

KFP version: 2.2.0
python:
kfp 2.8.0
kfp-kubernetes 1.2.0
kfp-pipeline-spec 0.3.0
kfp-server-api 2.0.5

It would be great if this was solved!
I see that this issue was added to the KFP 2.4.0 milestone which is due by January 15 2025. When is this release expected to be available?

@vishal-MLE
Copy link

@MarkTNO how can we find out if a kfp pipeline run was successful in v1 ?

@MarkTNO
Copy link

MarkTNO commented Nov 18, 2024

@vishal-MLE you mean via python kfp?
Like so: if "{{workflow.status}}" == "Succeeded":

@DharmitD
Copy link
Contributor

There seem to be 2 user concerns here:

  1. Incorrect run status reporting for Runs using ExitHandler: pipeline overall run status inaccurately reports Succeeded when using an ExitHandler that includes a failing task, but succeeding exit_task

Created a separate issue for this concern: #11405

mprahl added a commit to mprahl/pipelines that referenced this issue Dec 17, 2024
As described in kubeflow#10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Signed-off-by: mprahl <[email protected]>
mprahl added a commit to mprahl/pipelines that referenced this issue Dec 17, 2024
As described in kubeflow#10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Resolves:
kubeflow#11405

Signed-off-by: mprahl <[email protected]>
mprahl added a commit to mprahl/pipelines that referenced this issue Jan 2, 2025
As described in kubeflow#10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Resolves:
kubeflow#11405

Signed-off-by: mprahl <[email protected]>
mprahl added a commit to mprahl/pipelines that referenced this issue Jan 9, 2025
As described in kubeflow#10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Resolves:
kubeflow#11405

Signed-off-by: mprahl <[email protected]>
@HumairAK HumairAK modified the milestones: KFP 2.4.0, KFP 2.5.0 Jan 9, 2025
mprahl added a commit to mprahl/pipelines that referenced this issue Jan 10, 2025
As described in kubeflow#10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Resolves:
kubeflow#11405

Signed-off-by: mprahl <[email protected]>
google-oss-prow bot pushed a commit that referenced this issue Jan 10, 2025
…ers (#11470)

As described in #10917, exit handlers were implemented as dependent
tasks that always ran within an Argo Workflow. The issue is that this
caused the pipeline to have a succeeded status regardless of if the
tasks within the exit handlers all succeeded.

This commit changes exit handlers to be exit lifecycle hooks on an
Argo Workflow so that the overall pipeline status is not impacted.

Resolves:
#11405

Signed-off-by: mprahl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

9 participants