-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] {{workflow.status}} and {{workflow.failures}} Behavior in KFP v2 dsl.ExitHandler versus v1 #10917
Comments
This version of KFP came from a fork of the distribution, so I'm not sure how much it was in sync with the actual KFP 2.1.0. We also released KFP 2.2.0, and this is where we upgraded to Argo 3.4.x (which in part is what the issue in question should be resolved). Please, test with the latest KFP community release and let us know if the issue is fixed. |
@rimolive for reference, the fork is functionally identical for Kubeflow Pipeline 2.1.0 the only changes are about object store authentication for KFP v1. The first issue raised by @tom-pavz may have been fixed in Kubeflow Pipelines 2.2.0, but it seems unlikely given I don't see any changes to the Because @tom-pavz gave a very simple example, it should be straightforward to check on a vanilla Kubeflow Pipelines 2.2.0, perhaps even @tom-pavz could do that. However, the second issue (using |
@rimolive @thesuperzapper Thanks for the responses! I am not super confident on how to test upgrading to As for the |
@tom-pavz , regarding testing, I was more meaning to just deploy the raw manifests (not deployKF) on a testing cluster (even a local one) to confirm if the issue is still present. Although, 1.9.0 is not technically out, so there is no way to use KFP 2.2.0 from the manifests yet (except with the release candidates). But yes, it is very likely that both issues still exists in KFP 2.2.0. |
Hi @rimolive, we use the latest KFP version (2.2.0) and the latest SDK (2.7.0) and I can confirm that the issue exists. The ExitHandler somehow wraps failures and KFP UI reports SUCCESS for failed Runs. Therefore users have to check internally the execution status of each Run. |
Is the KFP community planning on fixing this in a future release now that it has been confirmed a couple of ways that it is indeed an issue? Seems like a regression in the Also, I think kubeflow/kubeflow#7422 is definitely a real issue, albeit created in the wrong place. Should another issue be created in this repo for it, or should we use this issue for both of these issues? |
/assign HumairAK |
Okay, digged into this a bit. tldr: we should try to use Argo Lifecycle Hooks here instead of having the Terminologies:
There seem to be 2 user concerns here:
I think it makes sense to have it in one issue however, as they are both very much related. 1. Incorrect run status reporting for Runs using `ExitHandlerIn KFP V2 we do not use Argo's The following illustrates what the ...
- arguments:
parameters: ...
depends: exit-handler-1.Succeeded || exit-handler-1.Skipped || exit-handler-1.Failed || exit-handler-1.Errored
name: post-msg-driver # exit_task
template: system-container-driver
... As you can see the In the future this may be configurable. There is a workaround suggested in the linked thread, however it requires us to match the exit status of the Suggested Solutions: I can see 2 ways we could resolve this. One is to go back to using If we can't use Hooks, another suggestion is for to adjust the pipeline server logic to instead be more "intelligent" in how we evaluate the Pipeline Run's condition, currently we just look at the Argo Workflow's phase field (i.e. here). 2. Unable to detect
|
Link to KFP community meeting where this issue was discussed |
Thanks for the great writeup above. This approach makes sense to me. |
We encounter the same issue here which is blocking our move from kfp v1 to v2. KFP version: 2.2.0 It would be great if this was solved! |
@MarkTNO how can we find out if a kfp pipeline run was successful in v1 ? |
@vishal-MLE you mean via python kfp? |
Created a separate issue for this concern: #11405 |
As described in kubeflow#10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Signed-off-by: mprahl <[email protected]>
As described in kubeflow#10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Resolves: kubeflow#11405 Signed-off-by: mprahl <[email protected]>
As described in kubeflow#10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Resolves: kubeflow#11405 Signed-off-by: mprahl <[email protected]>
As described in kubeflow#10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Resolves: kubeflow#11405 Signed-off-by: mprahl <[email protected]>
As described in kubeflow#10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Resolves: kubeflow#11405 Signed-off-by: mprahl <[email protected]>
…ers (#11470) As described in #10917, exit handlers were implemented as dependent tasks that always ran within an Argo Workflow. The issue is that this caused the pipeline to have a succeeded status regardless of if the tasks within the exit handlers all succeeded. This commit changes exit handlers to be exit lifecycle hooks on an Argo Workflow so that the overall pipeline status is not impacted. Resolves: #11405 Signed-off-by: mprahl <[email protected]>
Environment
deployKF
2.1.0
2.7.0
Steps to reproduce
We use
dsl.ExitHanlder
plus Argo Workflow's Workflow Variables to automatically report failed pipeline runs to slack. This was working well in KFP v1, but no longer in v2.I have a simple hello-world pipeline that has the following
fail
component wrapped in adsl.ExitHanlder
with anexit_task
ofpost_msg_to_slack_on_pipeline_fail
.Starting in KFP v2, at the time of the
exit_task
runtime,{{workflow.status}}
gets a value ofRunning
and the{{workflow.failures}}
does not get any value and just prints out{{workflow.failures}}
itself. It seems that for some reason in KFP V2, the failure of the task does not propagate up and out to the pipeline itself by the time theexit_task
is running. In addition, I notice that the pipeline itself has an "Executed Successfully" status in the UI (see below screenshots of the DAG and sub-DAG), even though one of its tasks failed, which does not seem correct to me. I also notice that in the UI theexit-handler-1
sub-DAG is stuck in "Running" status, which also seems incorrect.Expected result
The expected result, and the behavior we are getting when using v1 of KFP, was that at the time of the
exit_task
runtime,{{workflow.status}}
got a value ofFailed
and the{{workflow.failures}}
got a json file with information about the failed tasks of the pipeline, which we could then send to slack.Materials and reference
Screenshots of the pipeline and its DAGs' in the UI:
![Screenshot 2024-06-17 at 12 40 38 PM](https://private-user-images.githubusercontent.com/145377244/340381954-eec711c7-de2e-4e06-ab60-3447309a04af.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2NzM0MDksIm5iZiI6MTczOTY3MzEwOSwicGF0aCI6Ii8xNDUzNzcyNDQvMzQwMzgxOTU0LWVlYzcxMWM3LWRlMmUtNGUwNi1hYjYwLTM0NDczMDlhMDRhZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQwMjMxNDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wMDZkY2NhNjVmYzI4NTEwNTE1Y2M2NDQ0YWFhYzFhOGMyNjI3MTgyMTk4NDg2MTZiODRiNDYwMTIxMTk2ZDcyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.bfah3ckKwm41Q3m04Ikf9uNMSP5ooDWYXowuSEsMA9E)
![Screenshot 2024-06-17 at 12 40 51 PM](https://private-user-images.githubusercontent.com/145377244/340381978-cf5d59bc-42d2-4c07-89a7-9940efbbf0e8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2NzM0MDksIm5iZiI6MTczOTY3MzEwOSwicGF0aCI6Ii8xNDUzNzcyNDQvMzQwMzgxOTc4LWNmNWQ1OWJjLTQyZDItNGMwNy04OWE3LTk5NDBlZmJiZjBlOC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNlQwMjMxNDlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNGQ4ZjJiMDM4NTdhNDkzMTRlOGYzYmZmM2U0Y2JlNDBkN2QwODU5MmRhNjNkMDNjODNjMjAzNzYyMzg2OTkwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.UDNUOuB1I1M9LCpWlH2D9I3d4LLQn5m0sknqLbJMStk)
I also attempted to use the
dsl.PipelineTaskFinalStatus
in adsl.ExitHandler
as detailed here but when trying to run the pipeline I am getting the following error.I saw this issue on this, but it was never really addressed as I suppose it was created in the wrong repo.
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: