-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executor reports task instance (...) finished (failed) although the task says it's queued #39717
Comments
I'm not sure there's an Airflow issue here. My initial thought is that you are experiencing issues related to your workers and perhaps they are falling over due to resource issues, i.e. disk, ram? I can see that you are using dynamic task mapping which, depending on what you are asking the workers to do, how many parallel tasks and the number of workers you have, could be overloading your capacity. |
Not sure...it seems related to redis? I have seen other people report similar issues:
Also, a lot of DAGs are failing within the same reason, so that's not entirely tied to Task Mapping at all. Some tasks fail very early...also this server has a lot of RAM, of which I've granted ~12gb to each worker and the task is very simple, just HTTP requests, all of them run in less than 2 minutes when they don't fail. |
I think the log you shared (source) erroneously replaced the "stuck in queued" log somehow. Can you check your scheduler logs for "stuck in queued"? |
@RNHTTR there's nothing stating "stuck in queued" on scheduler logs. |
same issue here |
I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled? |
Can you help me how to add this persistence? |
Hi @nghilethanh-atherlabs I've been experimenting with those configs as well: # airflow.cfg
# https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#task-acks-late
# https://github.com/apache/airflow/issues/16163#issuecomment-1563704852
task_acks_late = False
# https://github.com/apache/airflow/blob/2b6f8ffc69b5f34a1c4ab7463418b91becc61957/airflow/providers/celery/executors/default_celery.py#L52
# https://github.com/celery/celery/discussions/7276#discussioncomment-8720263
# https://github.com/celery/celery/issues/4627#issuecomment-396907957
[celery_broker_transport_options]
visibility_timeout = 300
max_retries = 120
interval_start = 0
interval_step = 0.2
interval_max = 0.5
# sentinel_kwargs = {} For the redis persistency, you can refer to their config file to enable persistency. Not sure it will sort out. But let's keep trying folks.
# docker-compose.yml
redis:
image: bitnami/redis:7.2.5
container_name: redis
environment:
- REDIS_DISABLE_COMMANDS=CONFIG
# The password will come from the config file, but we need to bypass the validation
- ALLOW_EMPTY_PASSWORD=yes
ports:
- 6379:6379
# command: /opt/bitnami/scripts/redis/run.sh --maxmemory 2gb
command: /opt/bitnami/scripts/redis/run.sh
volumes:
- ./redis/redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf
- redis:/bitnami/redis/data
restart: always
healthcheck:
test:
- CMD
- redis-cli
- ping
interval: 5s
timeout: 30s
retries: 10 |
Seeing this issue on 2.9.1 as well, also only with sensors. We've found that the DAG is timing out trying to fill up the Dagbag on the worker. Even with debug logs enabled I don't have a hint about where in the import it's hanging.
On the scheduler the DAG imports in less than a second. and not all the tasks from this DAG fail to import, many import just fine, at the same time on the same celery worker. below is the same dag file as above, importing fine:
one caveat/note is that it looks like the 2nd run/retry of each sensor is what runs just fine. We've also confirmed this behavior was not present on Airflow 2.7.3, and only started occurring since upgrading to 2.9.1. |
@andreyvital thank you so much for your response. I have setup and it works really great :) |
I was working on the issue with @seanmuth and increasing parsing time solved the issue.
|
Hello everyone, I'm currently investigating this issue, but I haven't been able to replicate it yet. Could you please try setting airflow/airflow/providers/celery/executors/celery_executor_utils.py Lines 187 to 188 in 2d53c10
|
Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround. |
We've released apache-airflow-providers-celery 3.7.2 with enhanced logging. Could you please update the provider version and check the debug log for any clues? Additionally, what I mentioned in #39717 (comment) might give us some club as well. Thanks! |
Can you try to set https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution to False and see if it helps @trlopes1974 ? |
I see the same issue, with dynamic task mapping in multiple instances of a DAG. All the pods have enough cpu-memory Executor: CeleryKubernetes when I re-run the failed tasks with this error, it goes through and finishes successfully |
@vizeit and anyone looking here and tempted to report "I have the same issue". PLEASE before doing it upgrade to 2.9.2 and latest celery provider. And when you do, report it here whether things are fixed, and if not, add logs from the celery executor. If you actually look at the discussion - some of related issues were fixed in 2.9.2 and Celery logging has been improved in latest provider to add more information. So the best thing you can do - is not really post "i have the same issue" but upgrade and let us know if it helped, and second best thing is to upgrade celery provider and post relevant logs. Just posting "I have the same issue in 2.9.1" is not moving a needle when it comes to investigating and fixing such problem. |
Sure, I can upgrade and check. I believe others here already tested on 2.9.2 reporting the same issue |
@NBardelot But going back to the point, we saw that some task errors occurred while some secret retrieval was requested. further investigation lead us to proxy-related dns issues when accessing vault that where causing the failures. |
Maybe this is related to SSH /SFTP operators ? I did found a similar issue refering paramiko ( used in SSH / SFTP) for instance, one of the failing tasks:
|
Quite likely |
Any thoughts @ashb , @ephraimbuddy - since you were involved in similar cases - I think this is a small fix (and happy to submit it - it's merely catching and logging all exceptions and skipping mini-scheduler when it happens) - but I am not 100% sure if that is a good idea. |
It's a good idea and would solve at least one of the issues that can lead to that log message. I'm okay with the solution. Other issues can also lead to the scheduler sending this log. |
Yeah - but one less is good :) |
I have been following this thread recently since we also experienced this issue on airlfow I see two issues being discussed in this thread:
Based on my observation on the logs when the issue happened the other day, these two are not the same issue. Issue 2 happens frequently, I can see about 1600 messages of such error on daily basis, and the number of errors I observe everyday are stable. Thanks @potiuk for providing a fix. https://github.com/apache/airflow/pull/41260/files could address issue 2, but issue 1 should be something else. Because the day the incident happened on our platform, I see a burst of messages like: by looking at the scheduler log when the issue happened, I notice this pattern being repeated for the same task multiple times for a given dag:
The same line is repeated for the each task in the that dag hundreds of times, which seems to be abnormal. Looks like scheduler dag processor runs into some issue and something failed during the scheduling phase. When this happens, all workers are still online and redis (celery borker) is also healthy. But all workers stopped picking up tasks from This is not a recurring issue, I have only observed it once after running on 2.8.4 for months. |
I tend to agree with @scaoupgrade . |
@scaoupgrade - @trlopes1974 . Yes. We actually discuseed it few comments above in case you missed it:
And yes - as long as we have more details that we can diagnose it, we might also in the future address other similar issues - one thing at a time. I am actually proposing to close that issue here and if someone can open another similar issue with details that explain other issues of the same type with "related to THIS ISSUE" - it would be great. It's extremely hard to discuss and reason when multiple different issues are mixed in a single huge thread. Closing that one and opening new one seeing that it happens after fixing part of the issue seems like the best thing that we can dol Generally speaking - unless we see enough evidences that point to some issue that can be diagnosed and/or repreoduced, there is not much anyone can do about it. With the stack-trace from @trlopes1974, it was quite easy to figure out. PROPOSAL: maybe anyone who experiences this one, applies the patch from my PR and then see if they can still see the issue and if they do - open a new issue - hopefully with some details that will allow someone to diagnose it/ reproduce, once it's known that at last this one is already patched. The patch should be asy to apply on any version of Airflow. |
I can easily close the issue and add simple instructions what anyone who sees similar issue should do (apply patch and if they see similar issue - report all the details there). |
I'm a noob, but I can follow instructions 😂
Unfortuntly I only have tomorrow to make ot happen has I'm going on
vacation!💪💪💪
A segunda, 5/08/2024, 20:45, Jarek Potiuk ***@***.***>
escreveu:
… I can easily close the issue and add simple instructions what anyone who
sees similar issue should do (apply patch and if they see similar issue -
report all the details there).
—
Reply to this email directly, view it on GitHub
<#39717 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7TGQFC4UH7ONDXR4QFE43ZP7I7BAVCNFSM6AAAAABH7TVE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZG44DSNRQHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @trlopes1974 . You have provided a log of useful information on this issue. I was following your logs and trying to understand what exactly happened. This is not a reproducible issue on my side. There could be a combination of different things to happen together to lead to this bug. The main component involved could be:
Worker side: |
We did saw some cases where the airflow task was marked as failed, you
could see an external_executer_task_id in the task details but the celery
task never apeard on flower.
Since we ajusted some dag processing configurations (on the phone now,
can't remeber wich) it seems to have disapeared... I think it was related
to dag reloading/timeouts ('dag bag'??) but it made no sense why it would
fail randomly on different dags...
A segunda, 5/08/2024, 21:31, scaoupgrade ***@***.***>
escreveu:
… I'm a noob, but I can follow instructions 😂 Unfortuntly I only have
tomorrow to make ot happen has I'm going on vacation!💪💪💪 A segunda,
5/08/2024, 20:45, Jarek Potiuk *@*.
*> escreveu: … <#m_-2936388578393586046_> I can easily close the issue and
add simple instructions what anyone who sees similar issue should do (apply
patch and if they see similar issue - report all the details there). —
Reply to this email directly, view it on GitHub <#39717 (comment)
<#39717 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AD7TGQFC4UH7ONDXR4QFE43ZP7I7BAVCNFSM6AAAAABH7TVE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZG44DSNRQHE
<https://github.com/notifications/unsubscribe-auth/AD7TGQFC4UH7ONDXR4QFE43ZP7I7BAVCNFSM6AAAAABH7TVE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZG44DSNRQHE>
. You are receiving this because you were mentioned.Message ID: @.*>
Thanks @trlopes1974 <https://github.com/trlopes1974> . You have provided
a log of useful information on this issue. I was following your logs and
trying to understand what exactly happened. This is not a reproducible
issue on my side, there needs to be a combination of different things to
happen together to lead to this bug.
The main component involved could be:
Scheduler side: mainly the interaction with executor to check task status
and re-queue task:
1. scheduler heartbeat executor to process event:
https://github.com/apache/airflow/blame/main/airflow/jobs/scheduler_job_runner.py#L879-L904
2. scheduler timer to fail task which has been queued long enough:
https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L1091-L1094
Worker side:
I don't notice anything abnormal on the worker side when the scheduler
error happens for now yet.
—
Reply to this email directly, view it on GitHub
<#39717 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7TGQCN33DV4KFGCQ6DD33ZP7OITAVCNFSM6AAAAABH7TVE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZHA3DGNZYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
When mini-scheduler raises an exception, it has a bit weird side effect - the task succeeds but it is seen as failed and scheduler gets confused. Also flower celery worker in this case shows an error. This happens for example when DAG contains non-serializable tasks. This PR swallows any exceptions raised in mini-scheduler and simply logs them as error rather than fail the process. Mini-scheduler is generally optional and we are also already sometimes skipping it already so occasional skipping is not a big problem. Fixes: apache#39717
#41276 to take care of the "2" cenario has the "1" has a solution now. |
@potiuk , if you guide-me, I can deploy the fix to our production env and see if "1" goes away |
@trlopes1974 -> I see you opened a new issue (cool) - for testing just apply the patch from #41260 to your installation - it might be building your own image with the change applied (git patch might be useful to generate patch that can be applied) or just manually modify the code in running venv/container. |
Will have to postpone that.. Vacation mode is on now!
Tried to change the taskinstance.py but got an error regarding fab_auth
??(Or something).had to revert.
A terça, 6/08/2024, 18:03, Jarek Potiuk ***@***.***> escreveu:
… @trlopes1974 <https://github.com/trlopes1974> -> I see you opened a new
issue (cool) - for testing just apply the patch from #41260
<#41260> to your installation - it
might be building your own image with the change applied (git patch might
be useful to generate patch that can be applied) or just manually modify
the code in running venv/container.
—
Reply to this email directly, view it on GitHub
<#39717 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7TGQEJAQ2WT3P4KRH5O23ZQD6V5AVCNFSM6AAAAABH7TVE5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRG42DKOBUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
What you think should happen instead?
No response
How to reproduce
I am not sure, unfortunately. But every day I see my tasks being killed randomly without good reasoning behind why it got killed/failed.
Operating System
Ubuntu 22.04.4 LTS
Versions of Apache Airflow Providers
Deployment
Docker-Compose
Deployment details
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: