Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast_transpose slower than naive_transpose #62

Open
hokru opened this issue May 19, 2020 · 2 comments
Open

fast_transpose slower than naive_transpose #62

hokru opened this issue May 19, 2020 · 2 comments

Comments

@hokru
Copy link
Contributor

hokru commented May 19, 2020

I've been running various DFT calculations with PSI4 inside Intel's VTune and gg_fast_transpose popped up a top hotspot As as test I exchanged it to gg_naive_transpose and saw a significant speed up (50% for a C60 test) for that function.

It's only 4-5% of the total CPU time for single-points, so no real bottleneck to worry about, but wondering why the blocked-transpose might be so much slower.

@dgasmith
Copy link
Owner

This is something that I noticed as well. I never quite understood why the block transpose was sometimes bad, but I never dug into it. The best thing to do would be to transpose as we come out of L1 cache to prevent two L1-DRAM round trips.

Sorry for the non answer, its an open question for me as well.

@hokru
Copy link
Contributor Author

hokru commented May 19, 2020

Happy enough with that answer ;-). At least I am not imagining things.

I did put the functions into a simple C++ program. Probably terrible style.
https://gist.github.com/hokru/3f16adf5505f49df95ceee024f75b200

Maybe the matrices need to be much larger to get the benefits from blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants