-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler pods memory leak on airflow 2.3.2 #27589
Comments
One more thing that was observed is that on checking the memory utilisation by process running on pod using top command, I observed that no more than 0.5 - 2% memory was utilised. I have a screen-shot but don't see any option to attach it here. |
Hey @bharatk-meesho are you sure that "memory leaking" not related to one of this Issues/PRs
Attach files by dragging & dropping, selecting or pasting them. |
Thanks @Taragolis I will take a look at these links. Below is graph of scheduler pod level memory increasing from grafana although all dags are paused. And there has been no changes in the system whatsoever. One another weird thing I noticed is that the memory utilization from top combined of all process doesn't match what I get from running "kubectl command". Attaching both screen-shots below. HPA command output shows about 55% memory utilization and eventually it keeps on increasing so far i have obsreved. While via top it doesn't seem to be more than 2% after SSHing into scheduler pod |
Also would be nice to know about type of the memory are "leaked" might be it is some caches (you could find info in other issues). And does it cause OOM? |
The I think it will cause oom, but I also think this is one of your modifications - airflow does not seem to use more memory, it looks like your modifications. I recommend you to install completely "vanilla" airflow - Helm and Image and see if you see the same growth. If not (I expect that), you can apply your modifications one-by-one and see which one causes the problem. This is the best way I can advice on diagnosing those kind of issues. It's extremely hard to know what it is without applying such technique. It can be antyhing - scripts running in background for example. The fact that you do not see it in airflow's container, would suggest that it might be another container - init container maybe even liveness probe running and leaving smth behind. |
I checked the memory stats on vanilla airflow (directly used the image provided by airflow for 2.3.2 without any modifications, still saw memory increasing) Going to try with 2.4.2 if this is fixed there. |
Do you know which process eats memory there? Are you using completely standard deployment with completely standard Helm Chart? Or do you have some modifications of yours? |
@potiuk how do I know which process eats memory? I am changing two things from official deployment process, below is my deployment process
I am using some libraries for extra functionality and for troubleshooting (since things like top is also not installed in pod)
Please let me know if any other details I should share, I feel I am not modifying much from official helm/image so these memory issues shouldn't be coming. |
Do you see the same leaks WITHOUT modifying anything from basic airflow? I do not know if your modifications changed it - but just comparing it with "baseline" might give a hint. The whole point about debugging such problems is that they might be casue by changes that do not look suspicious and the BEST way of debugging them is to do something call bisecting - which is a valid debugging technique:
Now - you are iterating over changes you made until you find tht one single change that causes the leak. This is usually fastest and most effective way of finding the root cause. If you have no obvious reason, this is the ONLY way. Even if you feel like "not much", I've seen totally unexpected things happening with "inocoulous" change.
If I only knew a straightforward answer, I would give it to you. I usually use top or better Htop to observe what's going on and then I try to dig deeper if I see anything suspicious. But this has its limits due to complex nature of memory usage on Linux. I am afraid I am not able to give simple answer to "how to check memory" - depending on which memory you observe leaking there might be sever different places, but none of them have simple recipes to use. Simply because memory in Linux is extremely complex subject - much more complex that you can think. There are various ways applications, and kernel use memory and simple "do that" solution does not exist. If you try to find how to do it in google you will find plenty of "how you can approach it" - first example I found is this https://www.linuxfoundation.org/blog/blog/classic-sysadmin-linux-101-5-commands-for-checking-memory-usage-in-linux - where youbut you will not get direct answers but you will get a few tools that you can try to use to see if any of those will give you some of answer. In many cases the kernel memory used will grow but it won't be attributed to eny single process (even if they are originated from the same process). At other times the memory used between processes will be partially (seemingly) duplicated, because they share "copy on wite" memory when processes forked and most of it is still shared. Likely observing how various memory values in htop (suggested over top) should give you some clues. But It can be a kernel that is leaking memory and you will not see it there - https://unix.stackexchange.com/questions/97261/how-much-ram-does-the-kernel-use have some other debugging techniques to see it. Most likely if you do not see any of the processes that are leaking memory, then likely you have kernel leaking it - which might mean many things - including your K8S instance is has for example shared volume with a buggy library (which will be nothing airflow might be aware about). Or even the monitoring software (i.e. grafana agent) might cause it. Hard to say and I am afraid I cannot help more by "try to pin-point the root cause". This is also why pin-pointing is very important and often the fastest way to debug stuff . No-one will be able to "guess" what it is by even looking at the modification but getting it down to single change causing the leak might help in getting closer where to look for it. |
Another option for pin-pointing is to selectively disable certain processes and compare the usage before/after. For example if you see a pod running with multiple processes in it 0 you can delete certain processes in some containers - changing an entrypoint command to run with "sleep 3600" will run - likely everything else that there is to run with something that for sure does not take memory - and you can see which process caused it. On top of that - again switching airfflow back to originl configuration and "vanilla" state might tell you for example that your configuration is your problem and applying configuraiotn (including logging handles, setting default values for host name check and many others might help with pin-pointing. It's almost certain Airlfow in the vanilla state has no leak. With the It would be far too easy to see - so it must be something on your side. The growth you should is pretty catastrophic and it would demand most of airflow installation to restart scheduler every day or so - which does not happen. I also suggest (if you get there to vanilla and the memory is stil growing) to test different airflow versions - maybe what you see is a mistake- and trying various versions might simply give more answers. And finally if you see it in several airflow versions, I would try to run other experiments - replacing scheduler with other components etc. etc. Unfortnately I cannot have access to your system to play with it but if I were you, this is what I'd do. |
I am still not sure what was wrong but trying airflow 2.4.2 solved my issue. @BobDu maybe you can also try with 2.4.2. |
Yes. I suggest to upgrade to the latest version and see if it still happens. I also think the overall memory usage is not enough. I read a bit about the subject and the problem is far from simple. Kubernetes not only runs your application but also runs monitoring, tweaks memory use for Pods via kubelet and additionally, when you run monitoring/Prometheus, the agent inside every pod migh impact the memory used by caching some stuff. You can read a lot about it here for example kubernetes-monitoring/kubernetes-mixin#227 Even if you use grafana, grafana can cause the increase - depend on the version. I think there is not much we can do in Airflow with WSS reporting showing those numbers unless someone can dig deeper and pin-point the memory usage to Airlfow, and not to other components (especially monitoring impacting the memory usage). I suggest to upgrade to latest versions of everything you have (k8s, grafana, prometheus, airflow) and try again. |
RSS memory continue to increase, the same. I will continue to investigate this issue, happy to share any progress. |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
@BobDu any luck with investigation? |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
I am using airlfow 2.3.2 on EKS 1.22, the airflow service on EKS was launched by making minor modifications in official helm chart regarding replicas and resources. It is observed that the memory utilization by scheduler pod keeps on increasing as age of scheduler pod increases. Also this is observed when all dags are paused and nothing is running on airlfow. Different versions(config changes) of official helm chat was used to spin different airflow services in the EKS cluster where this issue was observed.
What you think should happen instead
This shouldn't have happened, the memory should have remain almost same. The increase in memory made the replicas to increase over few days as replication was setup based on memory utilisation even though there was no changes in the env itself and it was serving any traffic.
How to reproduce
Should be able to reproduce by using official airflow helm-chart of 2.3.2 on AWS EKS 1.22
Operating System
PRETTY_NAME=“Debian GNU/Linux 11 (bullseye)” NAME=“Debian GNU/Linux” VERSION_ID=“11” VERSION=“11 (bullseye)” VERSION_CODENAME=bullseye ID=debian
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: