-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler Memory Leak in Airflow 2.0.1 #14924
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
i have same problem. i looked for all the channels and methods but did not solve it! |
This issue is also there in version 1.10.* however in version 2.0.* the issue is more severe and also we don't have option of run_duration hence have to deploy our own cron jobs to refresh scheduler regularly. |
Until there is a fix or you find a specific reason, you could handle the OOM -> kubernetes/kubernetes#40157 There is few initiatives 👍 |
Thanks for yours comments. |
It seems like this is specific to the Kubernetes executor? It’d be awesome if you can confirm. |
not completely.I use |
One observation I have is that the rate of memory leak increases with number of dags (irrespective of whether they are being run). It definitely has something to do with the dag parsing process. |
yep.In my production environment, when using a small number of jobs, no problems are found temporarily. |
This comment has been minimized.
This comment has been minimized.
Please keep this thread on topic with the scheduler memory issue. For usage questions, please open threads in Discussions instead. |
In my produce env. used : about 3 workers,40 dags, 1000 tasks. Many tasks keep |
I also got similar issue with Airflow 2.0.1 when using Kubernetes executor. Is there any update or timeline for this issue? |
We are also facing this issue right now. Any news? |
Hi, |
That is cool finding. |
Few questions. Do you know which processes/containers keep the memory? Is it scheduler (and which container)? Maybe you can see the breakdown per process as well ? I understand this is whole cluster memory, and I am trying to wrap my head around it and see where it can come from, because it is super weird behaviour to get back memory after deleting files (?). Dp you simply run "rm *" in the "/opt/airflow/logs/scheduler" and it drops immediately after? Or is there some delay involved? Do you do anything else than Maybe also you can see how many airflow related processes you have when scheduler runs? And maybe their number grows and then drops when you delete the logs? |
I did nothing but a rm and it dropped quite immediately (sorry the memory is brought back by prometheus andd you have delay but what I can tell you is that it dropped within 15s after I did the rm) |
fun fact |
Ah right. The last line you wrote (container_memory_cache) is GOLD. That probably would explain it and it's NOT AN ISSUE. When you open many files Linux basically will use as much memory it can for file caches. Whenever you read or write a file, the blocks of disk are kept also in memory just in case the files needs to be accessed by any process. It also marks them dirty in case the blocks change and evicts such dirty blocks from memory. Also when some process needs more memory than it has available, it will evict some unused pages from memory to free them. Basically for any system, that writes files to logs continuously and the logs are not modified later, the cache memory will grow CONTINUOUSLY until the limit set by kernel configuration. So depending on what your Kernel configuration is (basically the Kernel of your Kubernetes Virtual machines under the hood), you will see the metrics growing continuously (up to the kernel limit). You can limit the memory available to your Scheduler container to limit it "per container" (via giving it less memory resources) but basically as much memory you give to the scheduler container, it will be used for cache after some time (and will not be explicitly freed - but it's not a problem because the memory is effectively "free" - it's just used for cache and it can be freed immediately when needed). That would PERFECTLY explain why the memory drops immediately after the files are deleted - those files are deleted so the cache for those files should also get deleted by the system immediately. Instead of looking at total memory used you should look at the container_memory_working_set_bytes - metrics. It reflects the actually "actively used" memory. You can read more here: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66 You can also test it by running (from https://linuxhint.com/clear_cache_linux/):
In the container. This should drop your caches immediately without deleting the files. |
Actually one thing that it might be helping eve to keep the "cache" memory down (though it has barely no consequences). Do you happen to run any kind of automated log rotation ? We have a "clean-logs.sh" script in the official Image that can be run to clean the logs. This will have a side-effect of freeing the Page Cache memory used by that files: https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh |
I can set that launch with a Cron Job easily yes but even if I understand the cache thing, I don't get why it would cache files it doesn't even need to look at (when I create dummy folder in the logs folder) |
Ah cool. So at least we figured that one out. Then it should be no problem whatsoever. One thing we COULD do is we could potentially add this hint to kernel to not add the log files to the cache if this is a Page Cache. It's not a harm in general to get this cache growing, but adding the hint might actually save us (and our users!) from diagnosing and investigating issues like this ;) |
|
sync; echo 1 > /proc/sys/vm/drop_caches ->It's down 40m, and there's more than 200 |
Still - you can see whether it's process or cache memory that grows: For example here you can see how to check different types of memory used: https://phoenixnap.com/kb/linux-commands-check-memory-usage Could you check what kind of memory is growing ? |
I use: ps auxww | grep airflow at different times. I found the memory is increased from 100 MB to 220 MB. |
|
Can you please dump a few pmap outputs at different times and share it in .tar.gz or smth @lixiaoyong12 ? Without grep so that we can see everything. Ideally over of timespan of few hours so that we see that this is not a "temporary" fluctuation and see the trend ? |
Just to explain @lixiaoyong12 -> when you have a number of different dags and schedules, I think - depending on frequency etc. - this would be perfectly normal for scheduler to use more memory over time initially. Generally speaking it should stabilize after some time and then it will be fluctuating up/down dependning on what is happening. That's why I want to make sure this is not such a fluctuation, also if you could run periodically the cache cleanup and see if the memory is returning back to some more-or-less same value after some time. That would be most helpful! |
Extends the standard python logging.FileHandler with advise to the Kernel to not cache the file in PageCache when it is written. While there is nothing wrong with such cache (it will be cleaned when memory is needed), it causes ever-growing memory usage when scheduler is running as it keeps on writing new log files and the files are not rotated later on. This might lead to confusion for our users, who are monitoring memory usage of Scheduler - without realising that it is harmless and expected in this case. Adding the advice to Kernel might help with not generating the cache memory growth in the first place. Closes: apache#14924
@potiuk |
@potiuk |
🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 |
Thanks a lot ! That might really help with user confusion! |
* Advises the kernel to not cache log files generated by Airflow Extends the standard python logging.FileHandler with advise to the Kernel to not cache the file in PageCache when it is written. While there is nothing wrong with such cache (it will be cleaned when memory is needed), it causes ever-growing memory usage when scheduler is running as it keeps on writing new log files and the files are not rotated later on. This might lead to confusion for our users, who are monitoring memory usage of Scheduler - without realising that it is harmless and expected in this case. Adding the advice to Kernel might help with not generating the cache memory growth in the first place. Closes: #14924
* Advises the kernel to not cache log files generated by Airflow Extends the standard python logging.FileHandler with advise to the Kernel to not cache the file in PageCache when it is written. While there is nothing wrong with such cache (it will be cleaned when memory is needed), it causes ever-growing memory usage when scheduler is running as it keeps on writing new log files and the files are not rotated later on. This might lead to confusion for our users, who are monitoring memory usage of Scheduler - without realising that it is harmless and expected in this case. Adding the advice to Kernel might help with not generating the cache memory growth in the first place. Closes: #14924 (cherry picked from commit 43f595f)
I use helm 1.6.0 and airflow 2.2.5 why memory continuou increase? both shceduler and triggerer not webserver |
What kind of memory is it ? See the whole thread. There is different kind of memory and to might be observing cache memry growth for whatever reason. Depending on the type of memory it might or might not be a problem. Buy you need gto investigate it in detail. No one is able to diagnose it without you investigating based on three thread. The thread has all the relevant information. You need to see what process is leaking - whether it is airflow or system or some other process BTW. I suggest you open a new discussion with all the details. There is little value in commenting on closed issue. Remember also this is a free forum where people help when they can and their help is much more efficient if you give all the information and show that you've done your part. There also companies offering help for Airflow for money and they can likely do the investigation for you. |
Apache Airflow version: 2.0.1
Kubernetes version (if you are using kubernetes) (use
kubectl version
): v1.17.4Environment: Dev
What happened:
After running fine for some time my airflow tasks got stuck in scheduled state with below error in Task Instance Details:
"All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless: - The scheduler is down or under heavy load If this task instance does not start soon please contact your Airflow administrator for assistance."
What you expected to happen:
I restarted the scheduler then it started working fine. When i checked my metrics i realized the scheduler has a memory leak and over past 4 days it has reached up to 6GB of memory utilization
In version >2.0 we don't even have the run_duration config option to restart scheduler periodically to avoid this issue until it is resolved.
How to reproduce it:
I saw this issue in multiple dev instances of mine all running Airflow 2.0.1 on kubernetes with KubernetesExecutor.
Below are the configs that i changed from the default config.
max_active_dag_runs_per_dag=32
parallelism=64
dag_concurrency=32
sql_Alchemy_pool_size=50
sql_Alchemy_max_overflow=30
Anything else we need to know:
The scheduler memory leaks occurs consistently in all instances i have been running. The memory utilization keeps growing for scheduler.
The text was updated successfully, but these errors were encountered: