-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusing log for long running tasks: "dependency 'Task Instance Not Running' FAILED: Task is in the running state" #16163
Comments
I bumped into this issue and I misunderstood until now 😅 |
After some more investigation, it's very likely we see this log appearing an hour after a long running task started because of the default @david30907d maybe try changing def _broker_supports_visibility_timeout(url):
return url.startswith("redis://") or url.startswith("sqs://")
log = logging.getLogger(__name__)
broker_url = conf.get('celery', 'BROKER_URL')
broker_transport_options = conf.getsection('celery_broker_transport_options') or {}
if 'visibility_timeout' not in broker_transport_options:
if _broker_supports_visibility_timeout(broker_url):
broker_transport_options['visibility_timeout'] = 21600 |
@yuqian90 thanks, very useful information. I'll give |
Moved it to 2.2 now - I guess we have not addressed it yet in 2.1.1 |
In the case where the visibility timeout is reached, it's confusing that there is not a clear log line that the task has been killed for taking too long to complete. (If that's indeed what is happening.) @potiuk is it the case, that the Celery task is killed or is it simply no longer streaming logs into Airflow at that point? |
Not sure. Needs investigation |
@malthe That's not quite what is happening. My understanding is this:
So the first attempt is still running, and this message is the second concurrent worker (but for the same try_number) saying "I can't run this task, it's already running." |
@ashb so regardless of whether the first worker is still running or defunct/dead, shouldn't the second worker be able to "take over" somehow? Otherwise, what's the point in trying? |
This is largely airflow and celery "fighting" over behaviour |
Hi! |
From what i've observed, there's no point for the second Airflow worker to try running it. We should just silence the error.
|
Hi @yuqian90! |
@yuqian90 How can the worker tell it is the "second" time it's running? |
I am dealing with that issue since March 01, 2023. Airflow version : 2.2.2 and we are using with MWAA. Do you have any solution about it? When I run the tasks I am getting this error and subtask stucked in queued or running state. Here is the logs : INFO - Dependencies not met for <TaskInstancedependency 'Task Instance Not Running' FAILED: Task is in the running state |
As an first assumption : it looks like an environment issue and as a temporary fix : to retires maximum extent. |
I also did the same and opened the ticket to Aws. |
Please also help us to know if you get any response for AWS with any solution : as we have also opened a ticket to Google Cloud |
I've been facing this problem too, in my case my task doesn't even run. It is a task to run a job in Databricks with DatabricksDeferrableOperator and its upstreams last for approximately 2 hours, but in the log of the tasks that fail, the execution date is less than the end date of the upstreams. So it looks like it ran without the upstreams having successfully completed. When analyzing the scheduler log I found this:
|
One more observation : usually the becoz of below error: job used to fail or job will ignore the warning. Now job is getting stuck and its not even getting failed and moving forward. Any one got any solution pls help {taskinstance.py:999} INFO - Dependencies not met for <TaskInstance: file_data_transfer.taskid_data_transfer 2023-02-19T00:01:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state {taskinstance.py:999} INFO - Dependencies not met for <TaskInstance: file_data_transfer.taskid_data_transfer 2023-02-19T00:01:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run. {local_task_job.py:99} INFO - Task is not able to be run |
Any one has any solution to fix this ?? |
...
yes I have, and that's not the solution, task keeps being "externally set to failed" |
What is the action you have applied ?? And will that be going to at least complete temporarily ??? |
I can give you the full logs right now, my task is a PythonOperator that inserts huge CSV files into Postgres. Concretely, I pass a My task is always externally failed after 6 hours :/ |
same issue. a task that used to take ~4 hours is now taking ~8 hours since I switched to celery. it seems maybe on the retry it finishes? |
In our context (GCP Airflow 2.2.5 #16163 (comment)) the problem was correlated with Kubernetes pods being evicted due of lacking ressources. After scaling up our kubernetes cluster and updating to Airflow 2.4.3 (+ google-providers=8.8.0 due to https://stackoverflow.com/questions/74730471/gcstobigqueryoperator-not-working-in-composer-2-1-0-airflow-2-3-4/75319966#75319966 ) we have not seen this problem in the last month since upgrading . |
Hello, are there any new evidence on this bug? |
@ashb @potiuk and others watching, if you experience this issue, it's just that your task is taking more than 6 hours, in most cases your task will continue running, but you can't see your logs until the task finishes or fails. As a temporary fix, you can increase the time it takes for this to happen by increasing the value of I believe I have uncovered the cause of this issue and would appreciate feedback. The question is, why are we losing the logs after 6 hours? The When this happens, the new instance of this task will realize the task is still running on the old worker (it's not failed, and is even heart-beating the TaskInstance), and will correctly wait (because a task can not be "running" in two places). But now the airflow UI only shows the logs for this new "phantom task", which will always be:
Effectively, this issue is the result of celery taking matters into its own hands (outside of the control of the airflow scheduler), and telling a new worker to start the task which is still running on the old worker. Setting our Celery's purpose of Setting Finally, I want to highlight that Airflow's docs are confusing about what the default |
Care for submitting a PR @thesuperzapper ? That would be a small thing to give back for the years of business you have on top of Airflow |
According to the Celery docs:
It seems that it's a way to ensure at-least-once processing rather than at-most-once. I would say that since Airflow does retrying on its own accord, we want the latter which is why it probably should remain disabled. |
@thesuperzapper @potiuk , We are also facing the same issue, however in our case the previous set of logs are getting trimmed and the logs start with
Logs of previously running task append post these lines. Airflow version: 2.6.0 |
I am seeing this as well with Airflow v2.5.1 (not using kubernetes) w/Celery Executor. We also utilize s3 remote logging, and observe regular log for the first hour, then Once the task completes, the "real" log for the original task is appended below the |
Apache Airflow version: 1.10.* / 2.0.* / 2.1.*
Kubernetes version (if you are using kubernetes) (use
kubectl version
): AnyEnvironment:
uname -a
): AnyWhat happened:
This line in the TaskInstance log is very misleading. It seems to happen for tasks that take longer than one hour. When users are waiting for tasks to finish and see this in the log, they often get confused. They may think something is wrong with their task or with Airflow. In fact, this line is harmless. It's simply saying "the TaskInstance is already running so it cannot be run again".
What you expected to happen:
The confusion is unnecessary. This line should be silenced in the log. Or it should log something clearer.
How to reproduce it:
Any task that takes more than an hour to run has this line in the log.
The text was updated successfully, but these errors were encountered: