-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drcachesim: optimize cache simulator #1738
Comments
Xref original issue #1703 |
Does multithreading sound like a good solution to alleviate the issue? My initial thought is to assign each cache an independent pthread. The LLC thread contains a pthread mutex for all the I&D caches to arbiter it. If it sounds good, I will start to implement it to see if it helps. |
No, you should not parallel the cache simulator by assign each cache as an independent pthread. |
The right way should be split the cache into subregion, and each subregion is simulated by one thread. |
With larger cache hierarchies and higher associativity (such as simulating a full 2-socket Skylake system) I'm seeing significant time spent walking the ways looking for tags, particularly in invalidate() (this is with coherence turned on as well). I found that inserting a hashtable (if it's initialized to a large enough starting size) results in a 15% speedup for my setup. I'll post the PR. |
Replaces drcachesim's loops over all ways with a hashtable lookup. For larger cache hierarchies and caches with higher associativity this increases performance by 15% in cpu-bound tests on offline traces, when we use a large initial table size to avoid resizes which seem to outweigh the gains. The hashtable unfortunately results in a 15% slowdown on simple cache hierarchies, due to the extra time in erase() and other maintenance operations outweighing the smaller gains in lookup. Thus, we make the default to *not* use a hashtable and use the original linear walk, providing a method to optionally enable the hashtable. The cache simulator enables the hashtables for any 3+-level cache hierarchy with either coherence or many cores. Adds coherence to some existing 3-level-hierarchy tests to ensure we have tests that cover the hashtable path. The TLB simulator will need to tweak these hashtables: but it looks like it is already doing the wrong thing in invalidate() and other simulator_t methods, filed as #4816. Issue: #1738, #4816
Replaces drcachesim's loops over all ways with a hashtable lookup. For larger cache hierarchies and caches with higher associativity this increases performance by 15% in cpu-bound tests on offline traces, when we use a large initial table size to avoid resizes which seem to outweigh the gains. The hashtable unfortunately results in a 15% slowdown on simple cache hierarchies, due to the extra time in erase() and other maintenance operations outweighing the smaller gains in lookup. Thus, we make the default to *not* use a hashtable and use the original linear walk, providing a method to optionally enable the hashtable. The cache simulator enables the hashtables for any 3+-level cache hierarchy with either coherence or many cores. Adds coherence to some existing 3-level-hierarchy tests to ensure we have tests that cover the hashtable path. The TLB simulator will need to tweak these hashtables: but it looks like it is already doing the wrong thing in invalidate() and other simulator_t methods, filed as #4816. Issue: #1738, #4816
Currently, the cache simulator is ~500x of native execution, the overhead including profiling overhead, communication overhead, but the cache simulator's overhead dominates the overall slowdowns.
One simple optimization is to parallel the cache simulator by splitting the memory into sub-regions and runs a cache simulator for each sub-region.
The text was updated successfully, but these errors were encountered: