Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR - Unknown error in KubernetesJobWatcher. Failing #12229

Closed
gakhrejah opened this issue Nov 10, 2020 · 14 comments
Closed

ERROR - Unknown error in KubernetesJobWatcher. Failing #12229

gakhrejah opened this issue Nov 10, 2020 · 14 comments
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@gakhrejah
Copy link

Hi Team,

We are getting below error Logs while running the Apache Airflow On AWS EKS .
All the Pods(Tasks) are in completed state but not removed by Airflow. I had to do manual restart of scheduler it everything works for 2-3 days. Then again all the tasks are stuck .

ERROR LOGS
[2020-11-10 07:00:07,752] {{kubernetes_executor.py:447}} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2020-11-10 07:00:07,765] {{kubernetes_executor.py:351}} INFO - Event: and now my watch begins starting at resource_version: 107544455
[2020-11-10 07:00:07,782] {{kubernetes_executor.py:342}} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
self.worker_uuid, self.kube_config)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
**kwargs):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream
status=obj['code'], reason=reason)
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 107544455 (108550177)

Process KubernetesJobWatcher-135237:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 340, in run
self.worker_uuid, self.kube_config)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 364, in _run
**kwargs):
File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream
status=obj['code'], reason=reason)
kubernetes.client.exceptions.ApiException: (410)
Reason: Gone: too old resource version: 107544455 (108550177)

AIRFLOW_VERSION=1.10.9
ENVIRONMENT: QA| PROD
Docker Image : python:3.7-slim-buster

Please let us know if you require any more information and how we can resolve this issue . We have also tried to upgrade the AIRFLOW version to 1.10.10 but no luck.

@gakhrejah gakhrejah added the kind:bug This is a clearly a bug label Nov 10, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Nov 10, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@garacio
Copy link

garacio commented Nov 12, 2020

Same in bare metal k8s installation

[airflow-6fb4b8f58c-2jszc airflow] [2020-11-05 08:13:34,833] {kubernetes_executor.py:293} ERROR - Unknown error in KubernetesJobWatcher. Failing 
[airflow-6fb4b8f58c-2jszc airflow] Traceback (most recent call last): 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 287, in run 
[airflow-6fb4b8f58c-2jszc airflow]     self.worker_uuid, self.kube_config) 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 323, in _run 
[airflow-6fb4b8f58c-2jszc airflow]     for event in list_worker_pods(): 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream 
[airflow-6fb4b8f58c-2jszc airflow]     status=obj['code'], reason=reason) 
[airflow-6fb4b8f58c-2jszc airflow] kubernetes.client.exceptions.ApiException: (410) 
[airflow-6fb4b8f58c-2jszc airflow] Reason: Expired: too old resource version: 42945421 (43412510) 
[airflow-6fb4b8f58c-2jszc airflow]  
[airflow-6fb4b8f58c-2jszc airflow] Process KubernetesJobWatcher-66040: 
[airflow-6fb4b8f58c-2jszc airflow] Traceback (most recent call last): 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap 
[airflow-6fb4b8f58c-2jszc airflow]     self.run() 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 287, in run 
[airflow-6fb4b8f58c-2jszc airflow]     self.worker_uuid, self.kube_config) 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 323, in _run 
[airflow-6fb4b8f58c-2jszc airflow]     for event in list_worker_pods(): 
[airflow-6fb4b8f58c-2jszc airflow]   File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 177, in stream 
[airflow-6fb4b8f58c-2jszc airflow]     status=obj['code'], reason=reason) 
[airflow-6fb4b8f58c-2jszc airflow] kubernetes.client.exceptions.ApiException: (410) 
[airflow-6fb4b8f58c-2jszc airflow] Reason: Expired: too old resource version: 42945421 (43412510) 

aitflow version 1.10.12
k8s version v1.17.5

@gdtroszak
Copy link

We're observing the same thing.

Airflow version 1.10.9
k8s API server version v1.15.12

@bhavaniravi
Copy link
Contributor

Same issue with airflow 1.10.10

@gakhrejah
Copy link
Author

Hi All,

Can anybody let me know , how we can resolve this issue . It seems like this is still an open issue with Airflow.

@gdtroszak
Copy link

This issue seems to outline a workaround. It essentially amounts to downgrading the k8s client to v11.0.0.

@potiuk
Copy link
Member

potiuk commented Nov 17, 2020

@kaxil @ashb -> looks like we should limit the k8s client to <12.0.0 IMHO. WDYT ?

@kaxil kaxil added the provider:cncf-kubernetes Kubernetes provider related issues label Nov 17, 2020
@kaxil
Copy link
Member

kaxil commented Nov 17, 2020

Yup, this will be fixed in 1.10.13. Already fixed in Master by #11974

@edikmkoyan
Copy link

it isn't fixed in 1.10.13

@kaxil
Copy link
Member

kaxil commented Dec 24, 2020

it isn't fixed in 1.10.13

https://github.com/apache/airflow/blob/1.10.13/setup.py#L313

It is fixed, check the link I posted

@kaxil kaxil closed this as completed Dec 24, 2020
@edikmkoyan
Copy link

it isn't fixed in 1.10.13

https://github.com/apache/airflow/blob/1.10.13/setup.py#L313

It is fixed, check the link I posted

I used helmchart to helm install the airflow, didn't used the setup.py anyhow, I guess the docker image used has the wrong version ok the k8s client. I have the issue with the airflow 2.0.1.

@kaxil
Copy link
Member

kaxil commented Feb 17, 2021

it isn't fixed in 1.10.13

https://github.com/apache/airflow/blob/1.10.13/setup.py#L313
It is fixed, check the link I posted

I used helmchart to helm install the airflow, didn't used the setup.py anyhow, I guess the docker image used has the wrong version ok the k8s client. I have the issue with the airflow 2.0.1.

setup.py is used when you or the tool you use run pip install apache-airflow. The docker image seems to have the correct version, check below:

❯ docker run  -it apache/airflow:2.0.1-python3.6 bash
airflow@646279d8d88a:/opt/airflow$ pip freeze | grep kubernetes
apache-airflow-providers-cncf-kubernetes==1.0.0
kubernetes==11.0.0
airflow@646279d8d88a:/opt/airflow$

@itayB
Copy link
Contributor

itayB commented Jan 16, 2022

@kaxil and all other friends - this is something that still happening in v2.2.3.
The installed version is:

$ pip freeze | grep kubernetes
apache-airflow-providers-cncf-kubernetes==3.0.1
kubernetes==21.7.0

Do I still need to downgrade that much?
I see that this limitation has removed recently - will it solve the issue in the upcoming Airflow version?

@potiuk
Copy link
Member

potiuk commented Jan 16, 2022

@kaxil and all other friends - this is something that still happening in v2.2.3.

This issue has long been closed. If you see similar issue (I assume with resource too old), and have some logs. please open a new issue with all the details because it's very likely this is completely unrelated issue.

By specifying "this is something that still happening in v2.2.3" you basically do not tell - what happens, what logs, how often, is this an intermitten issue etc. There is no way we can even attempt to answer your question without knowing all the details.

So if you have similar issue. Please open a new issue and provide all details - or better - if you are not sure if this is an airflow issue at all, open a Github Discussion instead (still provide all the details there - maybe this is a K8S deployment issue that someone can help you solve there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

8 participants