-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Very slow CPU memcpy performance on M1. #6771
Comments
My suspicion is that NDArrays have heavy indexing related overhead which drags down perf |
The real timing is ~70ms vs ~35ms, Indexing overhead is a problem, but it should not be at 30ms level. |
Interesting finding: This code snippet could achieve the expected bandwidth (~55GB/s)
Further clang experiments:
==================================================
|
Working on some benchmark of Taichi vs clang15, and found surprising performance gap on Apple M1.
The C baseline func for clang++:
Testing against an 1GB float array, it reached
~50GB/s
performance on my macbook. Compilation option is-Og
. No multithreading.Taichi counterpart:
The performance is
7.6GB/s
for single-thread and22GB/s
for multi-thread. Maybe there's low-hanging fruits to improve CPU performance.The text was updated successfully, but these errors were encountered: