-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate drbbdup bookkeeping for single-case-forever uses #5411
Comments
This seems okay with me. Note, it would require setting up dup options again for translations and traces, which might be expensive with respect to the user's use-case. |
I'm not sure I understand: if you could spell this out? |
It is now clear that this overhead is the cause for out-of-memory failures we're seeing in a proprietary app: it has 2.5M blocks, so that's ~583MB extra just from drbbdup tracking even when duplication is turned off. We had vmcode size set to 1G, and client heap is reachable for backward compatibility: meaning that drbbdup's bookkeeping took up over half the limited reachable space. We can bump to 2G, but that's the absolute max and it precludes -vm_base_near_app and so incurs overheads. Maybe we should change the client heap default, or change drbbdup to explicitly request unreachable heap. |
If I understood you correctly, you would like to remove drbbdup_manager_t structures, which are associated with basic blocks for book-keeping purposes, to improve memory performance. This is possible, provided dynamic handling is not enabled (although I believe I can optimise the structure's memory footprint when enabled as well). However, this requires setting up a temporary manager, which in turn invokes a user callback, upon every basic block event, including for translations and traces. The manager is thrown away once instrumentation phases are completed. Currently, drbbdup limits this with the use of a hash-table, and only invokes the callback if the manager lookup fails. |
BTW, to be clear, the point I mentioned is not a blocker; just an observation. |
The proposal is to remove the drbbdup_manager_t structures only when there is zero duplication: i.e., when drbbdup is doing nothing. For drmemtrace, the normal mode is to not duplicate at all, and we'd like to avoid paying for having drbbdup linked in and available for runtime options to turn on when it is not actually turned on. |
Switch drbbdup's per-block bookkeeping to use unreachable heap instead of reachable. This data can really add up, taking up half a GB or more on large applications -- which really eats into or exceeds the limited reachable space. Quick small local test: Before: Basic block fragments generated : 5169 Trace fragments generated : 527 Peak vmm blocks for unreachable heap : 641 Peak vmm blocks for reachable heap : 336 After: Basic block fragments generated : 5169 Trace fragments generated : 527 Peak vmm blocks for unreachable heap : 757 Peak vmm blocks for reachable heap : 230 Issue: #5411
I think the approach I mentioned above caters for zero dups as well. I see that your approach is to use unreachable heap; this is a good first step. |
Switches drbbdup's per-block bookkeeping to use unreachable heap instead of reachable. This data can really add up, taking up half a GB or more on large applications -- which really eats into or exceeds the limited reachable space. Quick small local test: Before: Basic block fragments generated : 5169 Trace fragments generated : 527 Peak vmm blocks for unreachable heap : 641 Peak vmm blocks for reachable heap : 336 After: Basic block fragments generated : 5169 Trace fragments generated : 527 Peak vmm blocks for unreachable heap : 757 Peak vmm blocks for reachable heap : 230 Issue: #5411
Even with PR #5417 we're seeing a lot of reachable heap still in use. Looking at drbbdup's per-thread heap, it is much larger than expect: |
Lazy allocation is okay. I don't think making it unreachable is possible as the code in the cache directly accesses it to keep track of hits. For now, lazy allocation should mitigate the issue for your use-case as it should be enabled only for dynamic case handling. |
Hmm, I'm looking at drbbdup_insert_dynamic_handling() where it seems to load from the pointer stored in a TLS slot into a register. So that usage can handle unreachable heap. Is there another usage somewhere that is using PC-relative addressing (or an absolute address)? |
Okay, that works then! |
Moves all drbbdup per-thread memory to unreachable heap. Moves the drbbdup dynamic case hit table to be dynamically allocated based on a new drbbdup option "never_enable_dynamic_handling" not being set. This eliminates 128K per thread for drmemtrace, which makes a big difference on large applications. Sanity test: Before: $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- suite/tests/bin/client.annotation-concurrency libclient.annotation-concurrency.appdll.so A 4 64 3 Peak threads under DynamoRIO control : 5 Peak vmm blocks for unreachable heap : 767 Peak vmm blocks for reachable heap : 726 Peak vmm virtual memory in use (bytes) : 10293248 After: Peak threads under DynamoRIO control : 5 Peak vmm blocks for unreachable heap : 736 Peak vmm blocks for reachable heap : 250 Peak vmm virtual memory in use (bytes) : 8183808 Issue: #5411
Moves all drbbdup per-thread memory to unreachable heap. Moves the drbbdup dynamic case hit table to be dynamically allocated based on a new drbbdup option "never_enable_dynamic_handling" not being set. This eliminates 128K per thread for drmemtrace, which makes a big difference on large applications. Sanity test: Before: $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- suite/tests/bin/client.annotation-concurrency libclient.annotation-concurrency.appdll.so A 4 64 3 Peak threads under DynamoRIO control : 5 Peak vmm blocks for unreachable heap : 767 Peak vmm blocks for reachable heap : 726 Peak vmm virtual memory in use (bytes) : 10293248 After: Peak threads under DynamoRIO control : 5 Peak vmm blocks for unreachable heap : 736 Peak vmm blocks for reachable heap : 250 Peak vmm virtual memory in use (bytes) : 8183808 Issue: #5411
Augments drbbdup to allow a value of 0 for drbbdup_options_t.non_default_case_limit. When it is set to 0, no duplication will occur, and no per-block data structures will be allocated, saving memory. Adds a test case for this. Updates drmemtrace to use this when multi-case options are not selected. Sanity test: Before: $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- ~/spec2006/bzip2-test/bzip2_base.gcc-64bit ~/spec2006/bzip2-test/dryer.jpg 2 Peak vmm blocks for unreachable heap : 747 Peak vmm virtual memory in use (bytes) : 6963200 After: Peak vmm blocks for unreachable heap : 631 Peak vmm virtual memory in use (bytes) : 6389760 Along with the prior shift to use unreachable heap for per-block data and then to eliminate and improve per-thread data, this concludes the memory reductions for #5411. Fixes #5411
…5428) Augments drbbdup to allow a value of 0 for drbbdup_options_t.non_default_case_limit. When it is set to 0, no duplication will occur, and no per-block data structures will be allocated, saving memory. Long-term, integration with drmgr #5400 might provide a cleaner solution. Adds a test case for this. Updates drmemtrace to use this when multi-case options are not selected. Sanity test: Before: $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- ~/spec2006/bzip2-test/bzip2_base.gcc-64bit ~/spec2006/bzip2-test/dryer.jpg 2 Peak vmm blocks for unreachable heap : 747 Peak vmm virtual memory in use (bytes) : 6963200 After: Peak vmm blocks for unreachable heap : 631 Peak vmm virtual memory in use (bytes) : 6389760 Along with the prior shift to use unreachable heap for per-block data and then to eliminate and improve per-thread data, this concludes the memory reductions for #5411. Fixes #5411
With drbbdup inside drmemtrace (#3995) we're seeing drbbdup use
an extra ~35MB ("Peak vmm virtual memory in use" rstat) for ~150K blocks;
so ~140MB for ~600K blocks. This is not horrible, but is not insignificant,
and is unnecessary: if the user promises to never use more
than just one case (which memtrace can certainly do based on its own
runtime options at init time) drbbdup shouldn't need any bookkeeping.
The text was updated successfully, but these errors were encountered: