-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log files are still being cached causing ever-growing memory usage when scheduler is running #27065
Comments
The RotatingFileHandler is used when you enable it via `CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar behaviour as the FileHandler had when it comes to caching the file on the Kernel level. While it is harmless (the cache will be freed when needed), it is also misleading for those who are trying to understand memory usage by Airlfow. The fix is to add a custom non-caching RotatingFileHandler similarly as in apache#18054. Note that it will require to manually modify local settings if the settings were created before this change. Fixes: apache#27065
The RotatingFileHandler is used when you enable it via `CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar behaviour as the FileHandler had when it comes to caching the file on the Kernel level. While it is harmless (the cache will be freed when needed), it is also misleading for those who are trying to understand memory usage by Airlfow. The fix is to add a custom non-caching RotatingFileHandler similarly as in #18054. Note that it will require to manually modify local settings if the settings were created before this change. Fixes: #27065
@potiuk the cache memory is still growing 😿 |
Maybe The rotating file handler has another place where it copies files and leaves them behind. Not the end of the world (as you know this is no-harm-at-all and perfecrly normal to happen)/ Maybe i will take a look soon (or maybe you can @zahchliu - you could see how I've done that and you could potentially iterate on it and verify it in your test system and make a PR after you test it ? How about that? Also there are ways you can check if this might be the cause. Just delete the rotated files and see if that causes drrop in cache memory used. |
Would be great contribution back :) ? |
You can always drop the whole cache to verify what causes it: https://linuxhint.com/clear_cache_linux/ Also you can do some trial/error to see which files are in the cache as explained in this answer: Seems this is not easy to get list of files which contribute to cache, but if you have some guesses you might try to find out by using fntools. |
|
Very much so. This is the choice of using NFS to store logs :) |
|
silly rename 🤣 |
Nothing we can do about it :). But I am not sure if those are the culprits - accoding to the descriptions those should be removed when airflow stops keeping the file unless client crashes |
I also mount default
|
|
|
i'm also using AWS EFS 🤝 i think i'll try 1 (2 seems redundant if we're moving it out of NFS), they seem to be the easiest except 5, which involves educating all current/future maintainers to understand memory nuances 😅 |
BTW. I've heard VERY bad things about EFS when EFS is used to share DAGs. It has profound impact on stability and performance of Airlfow if you have big number of DAGs unless you pay big bucks for IOPS. I've heard that from many people. This is the moment when I usually STRONGLY recommend GitSync instead: https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca |
As counterintuitive as it is, I know what you are talking about :) |
It's always it depends on configuration and monitoring. I personally have this issue might be in Airflow 2.1.x and I do not know is it actually related to Airflow itself or some other stuff. Work with EFS definitely take more effort rather than GitSync. Just for someone who might found this thread in the future with EFS performance degradation might help: Disable save python bytecodes inside of NFS (AWS EFS) mount
Throughput in mode Bursting in first looks like miracle but when all Bursting Capacity go to zero it could turn into your life into the hell. Each newly created EFS share has about 2.1 TB BurstingCreditBalance. What could be done here:
|
This is very close to what I've heard! Good one @Taragolis! And yeah PYTHONDONTWRITEBYTECODE is also my typical recommendation. |
So I guess the quest shoudl continue :)/ But let me repeat again - if you have anything that creates files continuously and those files are not deleted and there is no advice from the kernel to not use cache - they will increase the cache memory used. You will likely never get it to 0 growth over time. Likely just ssh to your system and saving your history when you type in a command will increase cache used. And this is normal. So you are likely chasing a red herring. |
worse than a red herring, this is a mirage 😆 LOGGING_CONFIG["handlers"]["processor_manager"].update(
{
'maxBytes': 10485760, # 10M
"backupCount": 3,
}
) this makes the cache memory usage cap at 40~50Mb |
The RotatingFileHandler is used when you enable it via `CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar behaviour as the FileHandler had when it comes to caching the file on the Kernel level. While it is harmless (the cache will be freed when needed), it is also misleading for those who are trying to understand memory usage by Airlfow. The fix is to add a custom non-caching RotatingFileHandler similarly as in #18054. Note that it will require to manually modify local settings if the settings were created before this change. Fixes: #27065 (cherry picked from commit 126b7b8)
The RotatingFileHandler is used when you enable it via `CONFIG_PROCESSOR_MANAGER_LOGGER=True` and it exhibits similar behaviour as the FileHandler had when it comes to caching the file on the Kernel level. While it is harmless (the cache will be freed when needed), it is also misleading for those who are trying to understand memory usage by Airlfow. The fix is to add a custom non-caching RotatingFileHandler similarly as in #18054. Note that it will require to manually modify local settings if the settings were created before this change. Fixes: #27065 (cherry picked from commit 126b7b8)
Apache Airflow version
2.4.1
What happened
My Airflow scheduler memory usage started to grow after I turned on the
dag_processor_manager
log by doingexport CONFIG_PROCESSOR_MANAGER_LOGGER=True
see the red arrow below
By looking closely at the memory usage as mentioned in #16737 (comment), I discovered that it was the cache memory that's keep growing:
Then I turned off the
dag_processor_manager
log, memory usage returned to normal (not growing anymore, steady at ~400 MB)This issue is similar to #14924 and #16737. This time the culprit is the rotating logs under
~/logs/dag_processor_manager/dag_processor_manager.log*
.What you think should happen instead
Cache memory shouldn't keep growing like this
How to reproduce
Turn on the
dag_processor_manager
log by doingexport CONFIG_PROCESSOR_MANAGER_LOGGER=True
in the
entrypoint.sh
and monitor the scheduler memory usageOperating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
k8s
Anything else
I'm not sure why the previous fix #18054 has stopped working 🤔
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: