fast_transpose slower than naive_transpose #62

hokru · 2020-05-19T09:18:59Z

I've been running various DFT calculations with PSI4 inside Intel's VTune and gg_fast_transpose popped up a top hotspot As as test I exchanged it to gg_naive_transpose and saw a significant speed up (50% for a C60 test) for that function.

It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.

The text was updated successfully, but these errors were encountered:

dgasmith · 2020-05-19T11:49:41Z

This is something that I noticed as well. I never quite understood why the block transpose was sometimes bad, but I never dug into it. The best thing to do would be to transpose as we come out of L1 cache to prevent two L1-DRAM round trips.

Sorry for the non answer, its an open question for me as well.

hokru · 2020-05-19T12:19:47Z

Happy enough with that answer ;-). At least I am not imagining things.

I did put the functions into a simple C++ program. Probably terrible style.
https://gist.github.com/hokru/3f16adf5505f49df95ceee024f75b200

Maybe the matrices need to be much larger to get the benefits from blocking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast_transpose slower than naive_transpose #62

fast_transpose slower than naive_transpose #62

hokru commented May 19, 2020

dgasmith commented May 19, 2020

hokru commented May 19, 2020

fast_transpose slower than naive_transpose #62

fast_transpose slower than naive_transpose #62

Comments

hokru commented May 19, 2020

dgasmith commented May 19, 2020

hokru commented May 19, 2020