Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate drbbdup bookkeeping for single-case-forever uses #5411

Closed
derekbruening opened this issue Mar 10, 2022 · 11 comments · Fixed by #5428
Closed

Eliminate drbbdup bookkeeping for single-case-forever uses #5411

derekbruening opened this issue Mar 10, 2022 · 11 comments · Fixed by #5428
Assignees

Comments

@derekbruening
Copy link
Contributor

With drbbdup inside drmemtrace (#3995) we're seeing drbbdup use
an extra ~35MB ("Peak vmm virtual memory in use" rstat) for ~150K blocks;
so ~140MB for ~600K blocks. This is not horrible, but is not insignificant,
and is unnecessary: if the user promises to never use more
than just one case (which memtrace can certainly do based on its own
runtime options at init time) drbbdup shouldn't need any bookkeeping.

@derekbruening derekbruening self-assigned this Mar 10, 2022
@johnfxgalea
Copy link
Contributor

This seems okay with me. Note, it would require setting up dup options again for translations and traces, which might be expensive with respect to the user's use-case.

@derekbruening
Copy link
Contributor Author

Note, it would require setting up dup options again for translations and traces, which might be expensive with respect to the user's use-case.

I'm not sure I understand: if you could spell this out?

@derekbruening
Copy link
Contributor Author

It is now clear that this overhead is the cause for out-of-memory failures we're seeing in a proprietary app: it has 2.5M blocks, so that's ~583MB extra just from drbbdup tracking even when duplication is turned off. We had vmcode size set to 1G, and client heap is reachable for backward compatibility: meaning that drbbdup's bookkeeping took up over half the limited reachable space. We can bump to 2G, but that's the absolute max and it precludes -vm_base_near_app and so incurs overheads.

Maybe we should change the client heap default, or change drbbdup to explicitly request unreachable heap.

@johnfxgalea
Copy link
Contributor

johnfxgalea commented Mar 12, 2022

Note, it would require setting up dup options again for translations and traces, which might be expensive with respect to the user's use-case.

I'm not sure I understand: if you could spell this out?

If I understood you correctly, you would like to remove drbbdup_manager_t structures, which are associated with basic blocks for book-keeping purposes, to improve memory performance. This is possible, provided dynamic handling is not enabled (although I believe I can optimise the structure's memory footprint when enabled as well).

However, this requires setting up a temporary manager, which in turn invokes a user callback, upon every basic block event, including for translations and traces. The manager is thrown away once instrumentation phases are completed. Currently, drbbdup limits this with the use of a hash-table, and only invokes the callback if the manager lookup fails.

@johnfxgalea
Copy link
Contributor

BTW, to be clear, the point I mentioned is not a blocker; just an observation.

@derekbruening
Copy link
Contributor Author

The proposal is to remove the drbbdup_manager_t structures only when there is zero duplication: i.e., when drbbdup is doing nothing. For drmemtrace, the normal mode is to not duplicate at all, and we'd like to avoid paying for having drbbdup linked in and available for runtime options to turn on when it is not actually turned on.

derekbruening added a commit that referenced this issue Mar 12, 2022
Switch drbbdup's per-block bookkeeping to use unreachable heap instead
of reachable.  This data can really add up, taking up half a GB or
more on large applications -- which really eats into or exceeds the
limited reachable space.

Quick small local test:
Before:
                   Basic block fragments generated :              5169
                         Trace fragments generated :               527
              Peak vmm blocks for unreachable heap :               641
                Peak vmm blocks for reachable heap :               336
After:
                   Basic block fragments generated :              5169
                         Trace fragments generated :               527
              Peak vmm blocks for unreachable heap :               757
                Peak vmm blocks for reachable heap :               230

Issue: #5411
@johnfxgalea
Copy link
Contributor

I think the approach I mentioned above caters for zero dups as well. I see that your approach is to use unreachable heap; this is a good first step.

derekbruening added a commit that referenced this issue Mar 12, 2022
Switches drbbdup's per-block bookkeeping to use unreachable heap instead
of reachable.  This data can really add up, taking up half a GB or
more on large applications -- which really eats into or exceeds the
limited reachable space.

Quick small local test:
Before:
                   Basic block fragments generated :              5169
                         Trace fragments generated :               527
              Peak vmm blocks for unreachable heap :               641
                Peak vmm blocks for reachable heap :               336
After:
                   Basic block fragments generated :              5169
                         Trace fragments generated :               527
              Peak vmm blocks for unreachable heap :               757
                Peak vmm blocks for reachable heap :               230

Issue: #5411
@derekbruening
Copy link
Contributor Author

Even with PR #5417 we're seeing a lot of reachable heap still in use. Looking at drbbdup's per-thread heap, it is much larger than expect: drbbdup_per_thread.hit_counts is 64K*2=128K bytes. We have apps with 10K threads, so that's 1GB right there! As reachable heap that's half the max 2GB you can have. This needs to be made unreachable too, and probably that giant array should be allocated dynamically (since our use case doesn't need it).

@johnfxgalea
Copy link
Contributor

johnfxgalea commented Mar 14, 2022

Lazy allocation is okay. I don't think making it unreachable is possible as the code in the cache directly accesses it to keep track of hits. For now, lazy allocation should mitigate the issue for your use-case as it should be enabled only for dynamic case handling.

@derekbruening
Copy link
Contributor Author

Lazy allocation is okay. I don't think making it unreachable is possible as the code in the cache directly accesses it to keep track of hits. For now, lazy allocation should mitigate the issue for your use-case as it should be enabled only for dynamic case handling.

Hmm, I'm looking at drbbdup_insert_dynamic_handling() where it seems to load from the pointer stored in a TLS slot into a register. So that usage can handle unreachable heap. Is there another usage somewhere that is using PC-relative addressing (or an absolute address)?

@johnfxgalea
Copy link
Contributor

So that usage can handle unreachable heap.

Okay, that works then!

derekbruening added a commit that referenced this issue Mar 14, 2022
Moves all drbbdup per-thread memory to unreachable heap.

Moves the drbbdup dynamic case hit table to be dynamically allocated
based on a new drbbdup option "never_enable_dynamic_handling" not
being set.  This eliminates 128K per thread for drmemtrace, which
makes a big difference on large applications.

Sanity test:
Before:
  $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- suite/tests/bin/client.annotation-concurrency libclient.annotation-concurrency.appdll.so A 4 64 3
              Peak threads under DynamoRIO control :                 5
              Peak vmm blocks for unreachable heap :               767
                Peak vmm blocks for reachable heap :               726
            Peak vmm virtual memory in use (bytes) :          10293248
After:
              Peak threads under DynamoRIO control :                 5
              Peak vmm blocks for unreachable heap :               736
                Peak vmm blocks for reachable heap :               250
            Peak vmm virtual memory in use (bytes) :           8183808

Issue: #5411
derekbruening added a commit that referenced this issue Mar 15, 2022
Moves all drbbdup per-thread memory to unreachable heap.

Moves the drbbdup dynamic case hit table to be dynamically allocated
based on a new drbbdup option "never_enable_dynamic_handling" not
being set.  This eliminates 128K per thread for drmemtrace, which
makes a big difference on large applications.

Sanity test:
Before:
  $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- suite/tests/bin/client.annotation-concurrency libclient.annotation-concurrency.appdll.so A 4 64 3
              Peak threads under DynamoRIO control :                 5
              Peak vmm blocks for unreachable heap :               767
                Peak vmm blocks for reachable heap :               726
            Peak vmm virtual memory in use (bytes) :          10293248
After:
              Peak threads under DynamoRIO control :                 5
              Peak vmm blocks for unreachable heap :               736
                Peak vmm blocks for reachable heap :               250
            Peak vmm virtual memory in use (bytes) :           8183808

Issue: #5411
derekbruening added a commit that referenced this issue Mar 16, 2022
Augments drbbdup to allow a value of 0 for
drbbdup_options_t.non_default_case_limit.  When it is set to 0, no
duplication will occur, and no per-block data structures will be
allocated, saving memory.

Adds a test case for this.

Updates drmemtrace to use this when multi-case options are not
selected.

Sanity test:
  Before:
  $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- ~/spec2006/bzip2-test/bzip2_base.gcc-64bit ~/spec2006/bzip2-test/dryer.jpg 2
                Peak vmm blocks for unreachable heap :               747
              Peak vmm virtual memory in use (bytes) :           6963200
  After:
                Peak vmm blocks for unreachable heap :               631
              Peak vmm virtual memory in use (bytes) :           6389760

Along with the prior shift to use unreachable heap for per-block data
and then to eliminate and improve per-thread data, this concludes the
memory reductions for #5411.

Fixes #5411
derekbruening added a commit that referenced this issue Mar 17, 2022
…5428)

Augments drbbdup to allow a value of 0 for
drbbdup_options_t.non_default_case_limit.  When it is set to 0, no
duplication will occur, and no per-block data structures will be
allocated, saving memory.

Long-term, integration with drmgr #5400 might provide a cleaner solution.

Adds a test case for this.

Updates drmemtrace to use this when multi-case options are not
selected.

Sanity test:
  Before:
  $ bin64/drrun -rstats_to_stderr -t drcachesim -offline -max_global_trace_refs 10K -- ~/spec2006/bzip2-test/bzip2_base.gcc-64bit ~/spec2006/bzip2-test/dryer.jpg 2
                Peak vmm blocks for unreachable heap :               747
              Peak vmm virtual memory in use (bytes) :           6963200
  After:
                Peak vmm blocks for unreachable heap :               631
              Peak vmm virtual memory in use (bytes) :           6389760

Along with the prior shift to use unreachable heap for per-block data
and then to eliminate and improve per-thread data, this concludes the
memory reductions for #5411.

Fixes #5411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants