Optimized the performance of float object #218

daliziql · 2024-06-21T09:33:35Z

Hello,
Thank you for taking the time to review my pull request. Below is a brief overview of the changes and enhancements I've made. Please let me know if there are any questions or further clarifications needed.

Two PRs will be submitted in total; this is the first one.

PR1

This PR mainly focuses on optimizations for float.

In the initial tests, a peculiar result was observed: when testing Matrix on Android, the execution time for float was longer than that for double, which is counterintuitive. Therefore, an analysis was conducted on this part. The first step was to compare the instruction counts of the two test programs, revealing that the instruction count for the float test program was 712,287,424, while that for the double test program was 664,675,474. RTM uses pure C methods to implement double, yet surprisingly, the instruction count for float NEON was even higher than that for double, leading to further analysis.
By decompiling the instruction code of the double test code, it was found that the compiler, after optimization, inserted a large number of NEON intrinsics, significantly accelerating performance. The reasons for this optimization include:

Using 16-byte alignment
Extensive use of RTM_FORCE_INLINE for forced inline expansion

As a result, double performed much better than expected, but this only indicates that the compiler's optimization for double code is more aggressive, not that double is inherently faster than float. There must be areas in the float implementation that are more performance-costly, hence the following two optimizations were made:

1. Changing matrix parameter passing from value to reference
From the decompiled code of double, it can be seen that the compiler eventually inlines the function, expanding most of the code into a single function call. This disrupts the expected function call stack distribution, rendering RTM's designed argument transmission strategy ineffective. For most function parameters, the value copy method (argx) is used for passing, which inadvertently increases many copy operations. Under the Android ARM64 architecture, the definition of types is as follows:

using matrix3x3f_arg0 = const matrix3x3f;
using matrix3x3f_arg1 = const matrix3x3f;
using matrix3x3f_argn = const matrix3x3f&;

using matrix3x3d_arg0 = const matrix3x3d;
using matrix3x3d_arg1 = const matrix3x3d&;
using matrix3x3d_argn = const matrix3x3d&;

using matrix3x4f_arg0 = const matrix3x4f;
using matrix3x4f_arg1 = const matrix3x4f;
using matrix3x4f_argn = const matrix3x4f&;

using matrix3x4d_arg0 = const matrix3x4d;
using matrix3x4d_arg1 = const matrix3x4d&;
using matrix3x4d_argn = const matrix3x4d&;

using matrix4x4f_arg0 = const matrix4x4f;
using matrix4x4f_arg1 = const matrix4x4f;
using matrix4x4f_argn = const matrix4x4f&;

using matrix4x4d_arg0 = const matrix4x4d;
using matrix4x4d_arg1 = const matrix4x4d&;
using matrix4x4d_argn = const matrix4x4d&;

The settings for passing values by reference differ between float and double., which is one of the reasons for the slower speed of float.
By changing the matrix type parameter to reference passing, the test speed under float showed a significant improvement.

2. Modification of the vector_mix function
Compared to the conventional shuffle() implementation, RTM's vector_mix() is relatively special, allowing selection at any element position between two vectors, while conventional shuffle() implementations usually have the first two elements from the first vector and the last two from the second vector. This makes RTM's vector_mix() difficult to implement with simple instructions. However, we eventually made some optimizations based on compile-time behavior. The float version of vector_mix() can use __builtin_shufflevector() when compiled with the clang compiler, achieving maximum performance. For other platforms, we try to rely on compile-time behavior for acceleration.

template<mix4 comp0, mix4 comp1, mix4 comp2, mix4 comp3>
vector4f RTM_SIMD_CALL vector_mix(vector4f_arg0 input0, vector4f_arg1 input1) RTM_NO_EXCEPT
{
        constexpr int index0 = (int)comp0;
        constexpr int index1 = (int)comp1;
        constexpr int index2 = (int)comp2;
        constexpr int index3 = (int)comp3;
#if defined(__clang__)
        return __builtin_shufflevector(input0, input1, index0, index1, index2, index3);
#else
        if constexpr (index0 < 4 && index1 < 4 && index2 >= 4 && index3 >= 4) {
                return vector_shuffle(input0, input1, index0, index1, index2 - 4, index3 - 4);
        }
        else if constexpr(index0 < 4 && index1 < 4 && index2 < 4 && index3 < 4) {
                //no input1 use here
                return vector_swizzle(input0, index0, index1, index2, index3);
        }
        else if constexpr(index0 >=4 && index1 >=4 && index2 >=4 && index3 >=4) {
                //no input0 use here
                return vector_swizzle(input1, index0 - 4, index1 - 4, index2 - 4, index3 -4);
        }else {

                float combine_arr[8];
                vector_store(input0, combine_arr);
                vector_store(input1, combine_arr + 4);
                return vector_set(combine_arr[index0], combine_arr[index1], combine_arr[index2], combine_arr[index3]);
        }
#endif
}

CLAassistant · 2024-06-21T09:33:40Z

All committers have signed the CLA.

nfrechette · 2024-06-23T15:33:29Z

Hello and thank you for the contribution!
I apologize for the late reply, I am just coming back from a trip abroad.

These changes to argument passing are quite subtle and sensitive. I'll have to double check things on my end and compare the generated assembly, etc. As a result, it will take me some time to review things. I anticipate that I'll have time to look into this in early July. I'll get back to you then.

In the meantime, I just wanted to give some general context. Your analysis makes a lot of sense, but there's a few things at play that are worth considering.

Float32 arithmetic on ARM uses NEON SIMD registers. This allows us to pass vector/quat/mask values by value in register and return them by register as well. For aggregate types (e.g. qvv, matrix), things are a bit more complicated. For clang, a few aggregate types (depending on size/internals) can be passed by value in register BUT aggregate values are not returned by register (unlike with __vectorcall with MSVC). When functions inline, this distinction doesn't really matter but when they don't it comes into play as it forces round-trips to stack memory (also called a load-hit-store). Typically, moderns processors handle this case quite well through store-forwarding but a few extra cycles on the load are still incurred. As a result of this, code that uses float32 ends up being quite dense with many instructions dependent on one another which can introduce bubbles in the execution and extra latency from store-forwarding further increases latency.

In contrast, float64 uses scalar math on ARM (for the time being, it is on my roadmap to use SIMD registers for XY and ZW in pairs like we do with SSE). Using scalar math causes the generated assembly to be much larger as many more instructions are required. This has an adverse effect on inlining as large functions don't inline as well. However, despite the large number of instructions, most of them can execute independently as SIMD lanes are often independent. This means that with float64, there are far fewer bubbles in the execution stream and there is far more work to execute. As a result, with modern out-of-order CPUs, they can be kept well fed with few to no stalls in execution. And so, even if each instruction is more expensive, the gap in execution cost between float32 and float64 might not be as large as one might expect in practice. Note that using XY and ZW in pairs will help reduce the assembly size, improving inlining and performance but because both pairs are often independent, the rest of the analysis remains consistent.

In the end, whether a function inlines or not is often the biggest performance impact at play and matrix math often uses many registers and many instructions, hindering inlining. Crucially, whether a function inlines or not is also determined by where it is called and so the measurements depend heavily on the sort of code that you have. Are you at liberty to share what the calling code looks like and which RTM functions are involved in your measurements or did you do broad measurements over a large and complex piece?

Cheers,
Nicholas

daliziql · 2024-06-24T07:38:33Z

Hi Nicholas,
Thank you for your reply.
The main content of the test involves matrix composition, transformation, and inversion operations. Below is the general framework of the test code:

//----------------------------------------------------------------------------------------
// matrix compose and transform
//----------------------------------------------------------------------------------------
template<typename FloatType, typename CalcPolicy>
static void DoMatrixComposeImpl(benchmark::State& state) {
    using Vector4Array = std::vector<TSimdVector<FloatType>>;
    using QuaternionArray = std::vector<TQuaternion<FloatType>>;
    
    Vector4Array   translationArray;
    Vector4Array   scaleArray;
    QuaternionArray    quatArray;
    Vector4Array   orignalArray;
    Vector4Array    resultArray;

    ...

    for (int i = 0; i < kMathCalcCount; i++) {
       translationArray[i] = TSimdVector<FloatType>(MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), 1.0f);
       scaleArray[i] = TSimdVector<FloatType>(1.0f, 1.0f, 1.0f, 1.0f);
       quatArray[i] = TQuaternion<FloatType>::fromAxisAngle(TVector3<FloatType>::YAxisVector, ScalarTool::degreesToRadians(MathTool::rangeRandom(0.0, 90.0)));
       orignalArray[i] = TSimdVector<FloatType>(0.0, 0.0, 0.0, 1.0);
    }

    for (auto&& _ : state) {
       for (int i = 0; i < kMathCalcCount; i++) {
           TMatrix4<FloatType> tMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(translationArray[i], TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> sMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), scaleArray[i], TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> rMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), quatArray[i]);

           TMatrix4<FloatType> matrix = tMat.template _simd_multiOther<CalcPolicy>(rMat).template _simd_multiOther<CalcPolicy>(sMat);

           resultArray[i] = matrix.template _simd_transformVector4<CalcPolicy>(orignalArray[i]);
       }
    }
}

In the _simd_xxxx functions within the TMatrix4 class, all matrix operations are implemented internally within our project. Some more specific functions are initially implemented as follows:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3f_arg0 lhs, matrix3x3f_arg1 rhs) RTM_NO_EXCEPT {
    matrix3x3f_arg0 out_m{};
    
    vector4f tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    
    return out_m;
}
auto r = simd::matrix_mul_fill_mode(rhs, lhs);

Here, we encounter the issue of parameter copy passing and the return value problem you mentioned earlier. We have since made changes to such function calls:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3d_argn lhs, matrix3x3d_argn rhs, matrix3x3d &out_m) RTM_NO_EXCEPT {
    vector4d tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    out_m.z_axis = tmp;
}

TMatrix4<T> r{};
simd::matrix_mul_fill_mode(rhs.simdRef(), lhs.simdRef(), r.simdRef());

The main change was to modify the parameter passing of the matrix to be by reference. The performance after these modifications has already shown significant improvement.
Additionally, I would like to mention that the performance issues discussed here are occurring on ARM64 Android devices. The performance on Windows and Mac aligns with your expectations.

daliziql · 2024-06-24T07:44:46Z

One more thing is that our project's C++ version is quite high. This PR did not handle compatibility with C++11 well, so I need to make some adjustments.

nfrechette · 2024-06-30T15:51:37Z

Thank you for the clarification. I will see if I can add a benchmark test based on your sample and see if I can reproduce locally.

What kind of processors/android device are you seeing this on?

I'll take a look at this in the next 2 weeks.

daliziql · 2024-07-01T02:13:18Z

The processor is snapdragon-xr2-gen2

daliziql · 2024-07-01T03:30:20Z

I encountered an issue with unit tests. The configurations build pull request / vs2022 (vs2022-clang, release, x64, -simd) and build pull request / vs2022 (vs2022-clang, release, x64, -avx) are indicating that some unit tests are failing. However, when I compile locally with the same CMake options, all tests pass. Do you have any additional information you can provide?

nfrechette · 2024-07-04T15:15:26Z

Yes, those failures are probably due to a known compiler/toolchain issue, see this PR for details: #212

I wouldn't worry about it for now. I'm waiting for github to update the image with a newer VS version that has a fixed clang version. Sadly, for reasons unknown, RTM ends up triggering a LOT of compiler bugs in various toolchains. Over the years, I've found dozens of bugs (and reported many) in msvc, gcc, and clang. Thankfully, it has gotten better over the years.

nfrechette

Lots of good stuff in here:

I like the idea of moving the vector mix details into its own header
I like the idea of using template specialization, I've ran into a lot of codegen issues with the existing function due to relying on constexpr branches
I like the idea of using std::enable_if to validate and branch variants

Just needs a bit of cleaning up and minor tweaks to bring back the missing AVX/NEON specializations for vector mix, see notes.

I'll profile the matrix argument passing change in the coming days and get back to you.

nfrechette · 2024-07-04T15:36:52Z