Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drcachesim: optimize cache simulator #1738

Open
zhaoqin opened this issue Jul 15, 2015 · 5 comments
Open

drcachesim: optimize cache simulator #1738

zhaoqin opened this issue Jul 15, 2015 · 5 comments

Comments

@zhaoqin
Copy link
Contributor

zhaoqin commented Jul 15, 2015

Currently, the cache simulator is ~500x of native execution, the overhead including profiling overhead, communication overhead, but the cache simulator's overhead dominates the overall slowdowns.

One simple optimization is to parallel the cache simulator by splitting the memory into sub-regions and runs a cache simulator for each sub-region.

@zhaoqin
Copy link
Contributor Author

zhaoqin commented Jul 15, 2015

Xref original issue #1703

@peterpengwei
Copy link

Does multithreading sound like a good solution to alleviate the issue? My initial thought is to assign each cache an independent pthread. The LLC thread contains a pthread mutex for all the I&D caches to arbiter it. If it sounds good, I will start to implement it to see if it helps.

@zhaoqin
Copy link
Contributor Author

zhaoqin commented Jul 29, 2015

No, you should not parallel the cache simulator by assign each cache as an independent pthread.
There would be significant communication overhead dominate the slowdown.

@zhaoqin
Copy link
Contributor Author

zhaoqin commented Jul 29, 2015

The right way should be split the cache into subregion, and each subregion is simulated by one thread.
Fro example, you can use 4 threads to simulate memory reference address from [4N, 4N+cacheline), [4N+cacheline, 4N+2xcacheline), [4N+2xcacheline, 4N+3xcacheline), and [4N+3xcacheline, 4N+4cacheline). By doing that, there would be no communication among the four threads, and should gain the max parallelization.
The potential downside is if the memory reference might concentrate on one or two cache,.

@derekbruening
Copy link
Contributor

derekbruening commented Mar 29, 2021

With larger cache hierarchies and higher associativity (such as simulating a full 2-socket Skylake system) I'm seeing significant time spent walking the ways looking for tags, particularly in invalidate() (this is with coherence turned on as well). I found that inserting a hashtable (if it's initialized to a large enough starting size) results in a 15% speedup for my setup. I'll post the PR.

derekbruening added a commit that referenced this issue Apr 7, 2021
Replaces drcachesim's loops over all ways with a hashtable lookup.
For larger cache hierarchies and caches with higher associativity this
increases performance by 15% in cpu-bound tests on offline traces,
when we use a large initial table size to avoid resizes which seem to
outweigh the gains.

The hashtable unfortunately results in a 15% slowdown on simple cache
hierarchies, due to the extra time in erase() and other maintenance
operations outweighing the smaller gains in lookup.  Thus, we make the
default to *not* use a hashtable and use the original linear walk,
providing a method to optionally enable the hashtable.  The cache
simulator enables the hashtables for any 3+-level cache hierarchy with
either coherence or many cores.

Adds coherence to some existing 3-level-hierarchy tests to ensure we
have tests that cover the hashtable path.

The TLB simulator will need to tweak these hashtables: but it looks
like it is already doing the wrong thing in invalidate() and other
simulator_t methods, filed as #4816.

Issue: #1738, #4816
derekbruening added a commit that referenced this issue Apr 8, 2021
Replaces drcachesim's loops over all ways with a hashtable lookup.
For larger cache hierarchies and caches with higher associativity this
increases performance by 15% in cpu-bound tests on offline traces,
when we use a large initial table size to avoid resizes which seem to
outweigh the gains.

The hashtable unfortunately results in a 15% slowdown on simple cache
hierarchies, due to the extra time in erase() and other maintenance
operations outweighing the smaller gains in lookup.  Thus, we make the
default to *not* use a hashtable and use the original linear walk,
providing a method to optionally enable the hashtable.  The cache
simulator enables the hashtables for any 3+-level cache hierarchy with
either coherence or many cores.

Adds coherence to some existing 3-level-hierarchy tests to ensure we
have tests that cover the hashtable path.

The TLB simulator will need to tweak these hashtables: but it looks
like it is already doing the wrong thing in invalidate() and other
simulator_t methods, filed as #4816.

Issue: #1738, #4816
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants