-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose malloc statistics #1275
Comments
Hey Antoine! I don't know either but it looks kinda too low-levelish. Do you have a use case? |
My use case was debugging memory use here: https://bugs.python.org/issue33444 |
Not having used Just to understand the use case a bit more. Are you hoping to call |
What I mean is that a framework (like Dask) which already exposes memory usage statistics (such as RSS) could expose additional useful information thanks to |
I've been trying to track down memory leaks when using Pandas in parallel in situations that look similar to the bug report pointed to by @pitrou . I agree that increased visibility would be of value. |
To rephrase my question in the Dask context, does this make sense to call from the nanny process or does it only make sense in the worker process? |
Only sense in the worker process, IMO. |
If I'm understanding this right, |
As for the usefulness of this, I'm skeptical. We are already able to determine memory leaks by using |
See https://bugs.python.org/issue33444. A higher
I see, perhaps |
You are right, RSS is an approximation. In fact I bumped into false positives for a long time before introducing USS which apparently either solved or mitigated the issue (not sure): psutil/psutil/tests/test_memory_leaks.py Lines 175 to 182 in 2ffb0ec
I say "not sure" because psutil's memory leak test script allows some tolerance: psutil/psutil/tests/test_memory_leaks.py Lines 135 to 136 in 2ffb0ec
Would With that said, I'm not against exposing |
In Python's case USS is not better than RSS for finding out memory fragmentation. |
If we figure out what |
To push this even further: test_memory_leaks.py tries hard to detect a function memory leaks by: Since that's not straightforward maybe we can have a utility function like this: >>> # signature
>>> test_leak(callable, times=1000, warmup_times=10, tolerance=4096)
>>>
>>> # success (returns None)
>>> test_leak(fun)
>>>
>>> # failure
>>> test_leak(fun)
AssertionError("46523 extra process memory after 1000 calls") Depending on how reliable such a function turns out to be it can live either in |
The Windows counterpart appears to be called |
List of useful links/info: |
See comment about |
Preamble ======= We have a [memory leak test suite](https://github.com/giampaolo/psutil/blob/e1ea2bccf8aea404dca0f79398f36f37217c45f6/psutil/tests/__init__.py#L897), which calls a function many times and fails if the process memory increased. We do this in order to detect missing `free()` or `Py_DECREF` calls in the C modules. When we do, then we have a memory leak. The problem ========== A problem we've been having for probably over 10 years, is the false positives. That's because the memory fluctuates. Sometimes it may increase (or even decrease!) due to how the OS handles memory, the Python's garbage collector, the fact that RSS is an approximation and who knows what else. So thus far we tried to compensate that by using the following logic: - warmup (call fun 10 times) - call the function many times (1000) - if memory increased before/after calling function 1000 times, then keep calling it for another 3 secs - if it still increased at all (> 0) then fail This logic didn't really solve the problem, as we still had occasional false positives, especially lately on FreeBSD. The solution ========= This PR changes the internal algorithm so that in case of failure (mem > 0 after calling fun() N times) we retry the test for up to 5 times, increasing N (repetitions) each time, so we consider it a failure only if the memory **keeps increasing** between runs. So for instance, here's a legitimate failure: ``` psutil.tests.test_memory_leaks.TestModuleFunctionsLeaks.test_disk_partitions ... Run #1: extra-mem=696.0K, per-call=3.5K, calls=200 Run #2: extra-mem=1.4M, per-call=3.5K, calls=400 Run #3: extra-mem=2.1M, per-call=3.5K, calls=600 Run #4: extra-mem=2.7M, per-call=3.5K, calls=800 Run #5: extra-mem=3.4M, per-call=3.5K, calls=1000 FAIL ``` If, on the other hand, the memory increased on one run (say 200 calls) but decreased on the next run (say 400 calls), then it clearly means it's a false positive, because memory consumption may be > 0 on second run, but if it's lower than the previous run with less repetitions, then it cannot possibly represent a leak (just a fluctuation): ``` psutil.tests.test_memory_leaks.TestModuleFunctionsLeaks.test_net_connections ... Run #1: extra-mem=568.0K, per-call=2.8K, calls=200 Run #2: extra-mem=24.0K, per-call=61.4B, calls=400 OK ``` Note about mallinfo() ================ Aka #1275. `mallinfo()` on Linux is supposed to provide memory metrics about how many bytes gets allocated on the heap by `malloc()`, so it's supposed to be way more precise than RSS and also [USS](http://grodola.blogspot.com/2016/02/psutil-4-real-process-memory-and-environ.html). In another branch were I exposed it, I verified that fluctuations still occur even when using `mallinfo()` though, despite less often. So that means even `mallinfo()` would not grant 100% stability.
I have a real life use case on dask.distributed. The package really struggles right now to tell apart genuine memory leaks and free'd memory that hasn't been returned to the OS yet. This meausure is used in heuristics for memory rebalancing and OOM safety net systems. Using the demo workbook attached to dask/distributed#4774 I can reliably produce such a "leak" where I allocate a bunch of large-ish numpy arrays (160 kib each) and then free them after a few seconds. After that operation, on my GUI I read: RSS: 1244 MiB full_memory_info says nothing interesting; note how uss and rss are almost the same: pfullmem(rss=1304870912, vms=2051538944, shared=31059968, text=2240512, lib=0, data=1326317568, dirty=0, uss=1274175488, pss=1275800576, swap=0) however, if I run this on the process: import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_stats() I read:
which is exactly the information I need. RSS: 1244 MiB When I run my rebalancing and anti-OMM algorithms, if I had this information I could consider 823 MiB instead of 1244 MiB, knowing that the rest will be reused at the next malloc. MacOSX Big Sur has exactly the same problem. I don't know where to get the same information though. |
On my Linux system I get this:
Which one of these values should we expose, in your opinion? Which one is useful to detect a memory leak? |
For the record, there's an old experimental branch where I exposed |
Looking at
The data returned, though, is undocumented, completely different than
In summary we have:
These must be the worst designed APIs in Linux. |
The difference between these two is the amount of memory that will get released if you invoke malloc_trim(0), or it will likely be reused if you invoke malloc:
I'm unsure what "system bytes" means. Note that it's slightly lower than USS. |
@giampaolo glibc 2.33 (released 2021-02-01) adds mallinfo2, which replaces mallinfo and supports >2GiB |
Notes:
I'm opening bug reports for both issues. [EDIT] |
Sweet! I missed that. It's so recent we can't assume we can rely on it though. When |
Does anybody have a clue about where to get the same information on Mac? The problem there seems to be even more pronounced e.g. rss hardly ever deflates. |
Ouch! I just saw your edit. It seems all APIs are bugged one way or another. :-\ |
To my knowledge malloc_info isn't buggy... |
I don't know if it's exactly in the scope for psutil, but just in case: it could be useful to expose per-platform malloc() statistics, for example using
mallinfo()
on GNU/Linux:http://man7.org/linux/man-pages/man3/mallinfo.3.html
The text was updated successfully, but these errors were encountered: