CML matrix multiplication is terribly inefficient #20

egorodet · 2020-01-06T10:47:39Z

CML matrix multiplication is based on naive algorithm which does not take into account any cache memory optimizations. As the result matrix multiplications becomes bottleneck in any real-world applications using CML. It would be nice to implement at least some memory optimizations (https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0). Also it would be great to use SIMD intrinsics (https://codereview.stackexchange.com/questions/101144/simd-matrix-multiplication) if not too hard to implement in template code.

demianmnave · 2020-07-08T02:03:08Z

Sincere apologies for not responding to this. For some reason, I do not receive new issue notifications.

Have you observed slowdown in a real application? If so, by how much and for what size matrices?

In general, CML is targeted at small, often-changing matrices. Please correct me if I am wrong, but as I recall, cache-conscious matrix product won't make much difference unless the matrices are larger than the L1 and/or L2 cache. Consequently, I'd imagine that one of the cache-optimized BLAS implementations would be more appropriate outside of typical CML use cases.

Similarly for SIMD, if you've observed a slowdown in a real application, I'd be very interested. I haven't verified recently, but for small, fixed-size vectors and matrices, loops should be compiler-unrolled automatically. Moreover, on Intel x64, SSE is required for floating-point arithmetic, and later Visual Studio releases should auto-parallelize the loops (this may require a special command-line argument, I don't recall at the moment). I can verify this and make sure I'm not doing anything to get in the compiler's way.

Finally, I'm very much interested in these sorts of optimizations, so if you have performance data and/or code that supports your request, that would be fantastic and most welcome. (Of course, pull requests with supporting performance tests would also be welcome.)

Thank you! Demian

egorodet · 2020-07-09T18:16:32Z

Hello Demian,

I've found this performance issue during optimization of the graphics sample demonstrating parallel rendering, which is a part of Methane Kit project. Methane Kit is a modern graphics API abstraction library and it is using CML as a base vector math library.

Asteroids sample demonstrates parallel rendering up to 50000 asteroids in realtime with animation updates on each frame. It was found that sample is CPU-bound because of animation updates are too slow even with multi-threaded execution. AsteroidsArray::Update function updates parameters of 50000 asteroids on each frame, including calculation of model and model-view-projection matrices, which are produced as the result of 4 multiplications of matrix44f_r for each asteroid. This results in 200000 matrix multiplications per AsteroidsArray::Update call on each frame executed in parallel 16 threads on my Intel Core i7-7820X in 16.2 ms, which is obviously to much for 60 FPS frame budget, if you want to do some render comands encoding besides calculating animations (sample was running at 40 FPS). Program was compiled on Windows 10 1909 with VS2019 MSVC 14.26 using standard CMake Release configuration flags.

To fix this I've implemented SSE-optimized template specialization of the matrix product for matrix44f_r - see modified version of cml/matrix/matrix_product.tpp. With optimized implementation in place the same AsteroidsArray::Update now takes 4.3 ms, having almost 4x performance boost on this function call. As the result sample is now running at 85 FPS with 50000 asteroids. I believe that it is possible to accelerate vector/matrix calculations on CPU even further in case of using fully SEE-optimized math library like hlslpp, but it lacks some functionality available in CML, so it is still on hold.

I think that CML is great C++ vector math library with extensive functionality, but it lacks crtically needed SSE optimizations to be used in high-performance and realtime applications. Hope that you could add these optimizations in future!

With Best Regards,
Evgeny.

demianmnave · 2020-07-10T00:12:01Z

Evgeny, thank you very much for the details. I will definitely look into it.

Just out of curiosity, was the slowdown in the 32-bit or 64-bit build of the Asteroids sample (or both)?

Thanks again!
Demian

demianmnave self-assigned this Jul 8, 2020

demianmnave added the info needed label Jul 8, 2020

demianmnave added enhancement and removed info needed labels Jul 10, 2020

demianmnave added this to the v2.2.0 milestone Jul 10, 2020

egorodet mentioned this issue Dec 23, 2020

Replace math library with vector instructions implementation MethanePowered/MethaneKit#78

Closed

demianmnave removed this from the v2.2.0 milestone May 14, 2021

demianmnave linked a pull request Dec 1, 2024 that will close this issue

Implement SIMD matrix multiplications #47

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CML matrix multiplication is terribly inefficient #20

CML matrix multiplication is terribly inefficient #20

egorodet commented Jan 6, 2020 •

edited

Loading

demianmnave commented Jul 8, 2020

egorodet commented Jul 9, 2020

demianmnave commented Jul 10, 2020

CML matrix multiplication is terribly inefficient #20

CML matrix multiplication is terribly inefficient #20

Comments

egorodet commented Jan 6, 2020 • edited Loading

demianmnave commented Jul 8, 2020

egorodet commented Jul 9, 2020

demianmnave commented Jul 10, 2020

egorodet commented Jan 6, 2020 •

edited

Loading