Large file/low memory mode #586

sevaa · 2024-12-07T20:40:50Z

Won't be the first time users are calling for a low memory mode for handling DWARF in extra large binaries. See also: #540, #576, #479; #481 for a slurp alternative.

I think the low memory mode should be explicitly opt in - I don't like to impose size thresholds on other people, let the library consumers do it. The large-file flag will be off by default, would go as an optional parameter into ELFFile.get_dwarf_info() and into the DWARFInfo() constructor, and will be kept there for all the parsing logic to see.

Eliminating the mandatory caching of CUs and DIEs would be rather trivial (but might break some API, or make them crazily slow - I'll have to try and see). But we still have the problem of slurping whole sections into memory. We kind of discussed it in #557, and settled that memory mapping is the way to go, however, how would you prefer to deal with transforms? If there is compression and/or relocation in DWARF sections, would you rather:

throw an exception that the binary is not compatible with the low memory mode
do the transforms anyway, even if it will eat a bunch of memory
let the user choose by either:
- introducing another optional, false by default boolean flag
- making the large-file flag a three way one (True|False|None or an enum)

Also, how would you like to test the large-file mode?

OBTW, any reason why _read_dwarf_section() writes the section data (a bytes) into BytesIO rather than constructing the BytesIO around the data?

The text was updated successfully, but these errors were encountered:

eliben · 2024-12-16T17:03:03Z

It's tricky. I worry this will significantly complicate the library just for the sake of some edge use cases. I wonder if this should be in scope at all...

OBTW, any reason why _read_dwarf_section() writes the section data (a bytes) into BytesIO rather than constructing the BytesIO around the data?

Not that I can think of now. You can try changing it, if all tests pass we should be fine

sevaa · 2024-12-17T14:41:10Z

Looks like you are not not too hot for this. I sympathize; breaking well tested, working code on purpose is a mental block for me also. Okay, let's try to scale back the impact.

How about a no-slurp (memory-map instead) mode but with the caches intact? This way, the only bits that have to change are _read_dwarf_section(), the debug section stream will be a custom, BytesIO-like object instead of BytesIO proper (similar to what we have in #481), and a couple of spots that rely on the result of read() being bytes will have to be patched. Another side effect will be that the binary will have to remain open while DWARFInfo is alive (because of the memory mapping); to do this right, we'd have to slap the context management methods on DWARFInfo.

eliben · 2024-12-20T13:42:46Z

That could be worth trying, to see what savings we get with large DWARF data and what performance effect it will have in normal mode

sevaa · 2024-12-20T15:39:39Z

The questions from the original post still stand.

eliben · 2024-12-20T17:04:10Z

We can start with the simplest option "throw an exception that the binary is not compatible with the low memory mode", and work from there.

Also, how would you like to test the large-file mode?

We can start by just testing this mode on small files, no? We don't necessarily need very large files in the test suite

If there are additional questions, please point them out explicitly, I did my best to locate them :)

sevaa · 2024-12-20T18:22:37Z

I'm wondering more along the lines of how does this fit into the autotest. Running the readelf/dwarfdump tests twice - once in no-slurp mode - would provide sufficient coverage, but would also be time consuming.

eliben · 2024-12-20T19:36:47Z

I'm wondering more along the lines of how does this fit into the autotest. Running the readelf/dwarfdump tests twice - once in no-slurp mode - would provide sufficient coverage, but would also be time consuming.

I'm not too worried about this, TBH.

First of all, running everything twice may be not too bad because it's all perfectly parallelizable.
Second, we don't really have to add all the tests in no-slurp mode, since the actual delta exercised between the two is small, it may be enough to re-run a small subset of tests in this mode.

sevaa · 2024-12-20T19:47:45Z

How would you prefer to identify the subset of test files to be run in no-slurp mode? I am not thrilled about introducing another folder with copies of the corpus binaries. Symlinks are not particularly source control friendly. A hard coded file list in a Python source smells too.

eliben · 2024-12-21T14:32:44Z

We already have some mentioning of specific test file names in the full tests, so a hard-coded list would be a reasonable start.

I'm trying to be pragmatic here. Maybe we can design a whole higher-level system for our tests where we can define exceptions like this (special skips or logic, re-running certain test files with certain modes, etc.) but right now I don't think it's worth the complexity. We can reconsider later if we see the complexity of this logic is growing and is starting to become a problem. Since it's just the internal tests and not a public API, I feel confident about being able to tweak and refactor it as we go.

sevaa · 2024-12-31T15:49:40Z

I have a working implementation now. API wise, it's an extra flag to get_dwarf_info, and an extra method to check for transforms. On the subset of the corpus that allows straightforward memory mapping of the DWARF sections - 10 binaries - the readelf test passes. From my analysis, all the return values from the API are identical to those in the regular mode.

I didn't do any comparisons of memory consumption on extra large binaries, since I don't have any. On small binaries, I suspect memory mapping is more or less equivalent to slurping anyway. The whole purpose of this endeavor was fixing OOM on large binaries, and we don't have any repro on that. Any idea how to approach this?

eliben · 2025-01-02T02:36:56Z

Cool!

Reports of problems with large files pop up in issues once in a while; maybe worth checking with these folks if they can upload the giant binaries they were having issues with somewhere?

Another option is build something like LLVM from source; IIRC the binaries (e.g. clang) are pretty sizable, especially when built in debug mode.

sevaa · 2025-01-05T21:04:40Z

Put together a sample binary with about 50 MB of DWARF data, ran with some profiling. After a full CU/DIE scroll, memory usage is dominated by the caches in the CU/DWARFInfo object - the process memory usage goes to 2.5 GB.

Looks like the memory consumption problem won't be solved by memory mapping alone.

That said, merely nuking DWARFInfo._cu_cache followed by a forced garbage collection frees most of it.

eliben · 2025-01-06T13:40:31Z

Thanks for checking. Interesting results. 2.6 GiB of memory usage for caching of a 50 MiB file seems excessive; maybe it can be optimized?

It may be that we should just keep pyelftools for the smaller binaries and not worry about large mode too much; it's a lot of complexity and may need very deep changes to support. But maybe some work can be done to make the caches more efficient, thus increasing the size "horizon" pyelftools supports out of the box.

sevaa mentioned this issue Dec 7, 2024

use lru cache compileunit and we can configure cache size #583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large file/low memory mode #586

Large file/low memory mode #586

sevaa commented Dec 7, 2024 •

edited

Loading

eliben commented Dec 16, 2024

sevaa commented Dec 17, 2024 •

edited

Loading

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 21, 2024

sevaa commented Dec 31, 2024

eliben commented Jan 2, 2025

sevaa commented Jan 5, 2025 •

edited

Loading

eliben commented Jan 6, 2025

Large file/low memory mode #586

Large file/low memory mode #586

Comments

sevaa commented Dec 7, 2024 • edited Loading

eliben commented Dec 16, 2024

sevaa commented Dec 17, 2024 • edited Loading

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 20, 2024

sevaa commented Dec 20, 2024

eliben commented Dec 21, 2024

sevaa commented Dec 31, 2024

eliben commented Jan 2, 2025

sevaa commented Jan 5, 2025 • edited Loading

eliben commented Jan 6, 2025

sevaa commented Dec 7, 2024 •

edited

Loading

sevaa commented Dec 17, 2024 •

edited

Loading

sevaa commented Jan 5, 2025 •

edited

Loading