-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify KPO to log container log periodically #37279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. CI is reporting 1 failure for mypy.
The only failed test in the previous CI run has passed now: https://github.com/apache/airflow/actions/runs/7868578876/job/21466117939#step:24:1. I think we're good to merge @pankajastro |
) | ||
if self.do_xcom_push: | ||
return result | ||
|
||
def execute_complete(self, context: Context, event: dict, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this method is used and where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajastro can you please take a look? With the use of the new method trigger_reentry
, if execute_complete
is no longer getting used anywhere, perhaps we could remove the method and clean this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it was kept for b/c.
I prefer to not include this change in the current providers wave if possible to have the time to test it. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hussein-awala
Actually, would be nice for us to include this in the providers wave as we are depending on the change 🙂 .
We tested this on our internal tests and they worked fine. How about we test this with the RC? Would that time be sufficient if we include this? We can work on providing fixes with utmost priority if a bug is identified. or You see some other risks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. Yes, this method is public so we can't remove it. similarly, a couple of methods in the trigger are unused now I kept them as well because they might be used by someone.
We have conducted some testing on our end. Once RC is available, we will perform additional tests. Hopefully, some community members will also test RC so we should be able find and fix if something unusual happend before releasing it.
@pankajastro , @pankajkoti - this is causing failures for deferrable tasks that should complete. I reverted this commit and things work normally after. Hoping you can help figure out the issue or revert this before the providers push. Sample DAG: from airflow import DAG
from airflow.providers.google.cloud.operators.kubernetes_engine import (
GKEStartPodOperator,
)
DEFAULT_TASK_ARGS = {
"owner": "gcp-data-platform",
"start_date": "2021-04-20",
"retries": 0,
"retry_delay": 60,
}
with DAG(
dag_id="test_gke_op",
schedule_interval="@daily",
max_active_runs=1,
max_active_tasks=5,
catchup=False,
default_args=DEFAULT_TASK_ARGS,
) as dag:
_ = GKEStartPodOperator(
task_id="whoami",
name="whoami",
cmds=["gcloud"],
arguments=["auth", "list"],
image="gcr.io/google.com/cloudsdktool/cloud-sdk:slim",
project_id="redacted-project-id",
namespace="airflow-default",
location="us-central1",
cluster_name="airflow-gke-cluster",
service_account_name="default",
deferrable=True,
# do_xcom_push=True,
)
_ = GKEStartPodOperator(
task_id="fail",
name="fail",
cmds=["bash"],
arguments=["-xc", "sleep 2 && exit 1"],
image="gcr.io/google.com/cloudsdktool/cloud-sdk:slim",
project_id="redacted-project-id",
namespace="airflow-default",
location="us-central1",
cluster_name="airflow-gke-cluster",
service_account_name="default",
deferrable=True,
# do_xcom_push=True,
) Expected result (obtained after reverting): ![]() Unexpected result (after rebasing - whoami should succeed): ![]() whoami logs:
|
@vchiapaikeo Would you please offer more help in understanding the issue and reproducibility? If I am understanding correctly, is it that we are getting an issue with the |
Subclasses of KPO still call the execute_complete method - https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/kubernetes_engine.py#L584-L593 Is this expected? |
Yes, I think that should not break. The existing public methods should continue to work as they were. I think you've caught the cause for issue in #37279 (comment). Thanks for the help in testing and identifying the issue. We will fix this cc: @pankajastro |
hmm, look like then "name" => "pod_name" is issue |
Hey @vchiapaikeo would it be possible for you to test #37363 with your use case? |
…run method In apache#37279 I introduce periodic logging of the container. During the process, I also changed a few event Dict key names and that is problematic for someone extending the KPO trigger. Also, the current execute_compelete method was unused in the KPO operator and was problematic if someone using it in an extended class since now the trigger can also emit an event even if the pod is in the pod intermediate state. one reported issue: apache#37279 (comment) In this PR I'm restoring the trigger event dict structure. Also, deprecating the execute_complete method
…run method (#37454) In #37279 I introduce periodic logging of the container. During the process, I also changed a few event Dict key names and that is problematic for someone extending the KPO trigger. Also, the current execute_compelete method was unused in the KPO operator and was problematic if someone using it in an extended class since now the trigger can also emit an event even if the pod is in the pod intermediate state. one reported issue: #37279 (comment) In this PR I'm restoring the trigger event dict structure. Also, deprecating the execute_complete method
I am observing an issue wrt to the recent deferrable KPO changes in PR apache#37279 and apache#37454, where when the pod fails to start within a specified timeout value, the KPO task is hanging forever whereas it is expected to fail after the timeout. This PR fixes the issue by correcting a logical error for detecting if elapsed timeout has occured for raising the timeout trigger event.
…#37514) I am observing an issue wrt to the recent deferrable KPO changes in PR #37279 and #37454, where when the pod fails to start within a specified timeout value, the KPO task is hanging forever whereas it is expected to fail after the timeout. This PR fixes the issue by correcting a logical error for detecting if elapsed timeout has occured for raising the timeout trigger event.
Should this already be compatible with the I am on I have
|
I think it should work |
hmm, this will require small changes PR #39348 |
… (#37514) I am observing an issue wrt to the recent deferrable KPO changes in PR apache/airflow#37279 and apache/airflow#37454, where when the pod fails to start within a specified timeout value, the KPO task is hanging forever whereas it is expected to fail after the timeout. This PR fixes the issue by correcting a logical error for detecting if elapsed timeout has occured for raising the timeout trigger event. GitOrigin-RevId: 6412b06a7b35a0743656dd3b2160f390f40108c2
… (#37514) I am observing an issue wrt to the recent deferrable KPO changes in PR apache/airflow#37279 and apache/airflow#37454, where when the pod fails to start within a specified timeout value, the KPO task is hanging forever whereas it is expected to fail after the timeout. This PR fixes the issue by correcting a logical error for detecting if elapsed timeout has occured for raising the timeout trigger event. GitOrigin-RevId: 6412b06a7b35a0743656dd3b2160f390f40108c2
… (#37514) I am observing an issue wrt to the recent deferrable KPO changes in PR apache/airflow#37279 and apache/airflow#37454, where when the pod fails to start within a specified timeout value, the KPO task is hanging forever whereas it is expected to fail after the timeout. This PR fixes the issue by correcting a logical error for detecting if elapsed timeout has occured for raising the timeout trigger event. GitOrigin-RevId: 6412b06a7b35a0743656dd3b2160f390f40108c2
The current state of the KubernetesPodOperator (KPO)
only prints container logs at the end of task execution.
While this is sufficient for short-running tasks,
it becomes less user-friendly when the container runs for an extended period.
This PR enhances the KPO by modifying the trigger and operator
to fetch container logs periodically
making it possible to monitor the task's progress in the Airflow task UI.
a new parameter have been introduced to the operator:
in seconds, that the task should remain deferred before resuming to fetch the latest logs.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.