You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running various DFT calculations with PSI4 inside Intel's VTune and gg_fast_transpose popped up a top hotspot As as test I exchanged it to gg_naive_transpose and saw a significant speed up (50% for a C60 test) for that function.
It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.
The text was updated successfully, but these errors were encountered:
This is something that I noticed as well. I never quite understood why the block transpose was sometimes bad, but I never dug into it. The best thing to do would be to transpose as we come out of L1 cache to prevent two L1-DRAM round trips.
Sorry for the non answer, its an open question for me as well.
I've been running various DFT calculations with PSI4 inside Intel's VTune and
gg_fast_transpose
popped up a top hotspot As as test I exchanged it togg_naive_transpose
and saw a significant speed up (50% for a C60 test) for that function.It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.
The text was updated successfully, but these errors were encountered: