Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Basic SIMD support for System.Numerics.Vector2 on arm64 #91659

Merged
merged 17 commits into from
Sep 21, 2023

Conversation

matouskozak
Copy link
Member

@matouskozak matouskozak commented Sep 6, 2023

Re-created PR that adds basic SIMD support for System.Numerics.Vector2 on arm64. Equaling the current support for System.Numerics.Vector4. Rename vector2_methods table to vector_2_3_4_methods to better reflect its usage.

Current SIMD support for Vector2 with mini/llvm:

  • SN_ctor
  • SN_Abs
  • SN_Add
  • SN_Clamp
  • SN_Divide (currently disabled Vector2 / float scenario, will enable in the next PR)
  • SN_Dot
  • SN_Max
  • SN_Min
  • SN_Multiply (same as with SN_Divide)
  • SN_Negate
  • SN_SquareRoot
  • SN_Subtract
  • SN_get_Item
  • SN_get_One
  • SN_get_UnitX
  • SN_get_UnitY
  • SN_get_Zero
  • SN_op_Addition
  • SN_op_Division
  • SN_op_Equality
  • SN_op_Inequality
  • SN_op_Multiply
  • SN_op_Subtraction
  • SN_op_UnaryNegation
  • SN_set_Item

Future work on the missing intrinsic is tracked here #91394.
Contributes to: #73462


p.s. These getters currently use 128-bit code paths for emitting const values (emit_xconst_v128) even for Vector2 (64-bit vector):

  • SN_get_UnitX
  • SN_get_UnitY
  • SN_get_One

Comment from @jandupej on the original PR:
You can use a fmov to flood the lower two floats with 1.0f. This gives you the fastest SN_get_One possible (there is a 64-bit variant of this, with q=0). To make SN_get_UnitX/Y you can shift the vector left or right as doubles by 32. Zeros are shifted in, so this will give you a (0.0f, 1.0f) or reverse. This will destroy the upper 64 bits of the register, but it shouldn't be a problem as only the lower 64 bits are of importance.

@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
src/mono/mono/mini/mini-arm64.c Outdated Show resolved Hide resolved
@matouskozak
Copy link
Member Author

Perf_Vector2 microbenchmarks on osx arm64 JIT-mini:

before after speed-up
CreateFromScalar 1.95 1.25 36%
OneBenchmark 1.96 0.88 55%
UnitXBenchmark 1.92 0.86 55%
UnitYBenchmark 1.91 0.89 54%
ZeroBenchmark 1.19 3.60 -202%
AddOperatorBenchmark 4.03 0.80 80%
DivideByVector2OperatorBenchmark 4.26 0.85 80%
DivideByScalarOperatorBenchmark 6.124 2.3218 62%
EqualityOperatorBenchmark 1.411 0.0359 97%
InequalityOperatorBenchmark 2.202 0.0221 99%
MultiplyOperatorBenchmark 4.081 0.7258 82%
MultiplyByScalarOperatorBenchmark 5.829 1.8867 68%
SubtractOperatorBenchmark 4.058 0.7137 82%
NegateOperatorBenchmark 4.143 0.7393 82%
AbsBenchmark 19.989 0.6328 97%
AddFunctionBenchmark 4.055 0.6135 85%
ClampBenchmark 15.859 0.8157 95%
DivideByVector2Benchmark 4.273 0.7721 82%
DivideByScalarBenchmark 6.453 2.4231 62%
DotBenchmark 2.085 0.1414 93%
MaxBenchmark 6.354 0.7172 89%
MinBenchmark 6.214 0.6811 89%
MultiplyFunctionBenchmark 4.261 0.8311 80%
NegateBenchmark 4.287 0.8247 81%
SquareRootBenchmark 10.355 0.7057 93%
SubtractFunctionBenchmark 4.239 0.7174 83%

Vector2.Zero is reporting 202% regression even though the emitted code looks correct:

0000000000000000        stp     x29, x30, [sp, #-0x50]!
0000000000000004        mov     x29, sp
0000000000000008        eor.8b  v0, v0, v0
000000000000000c        str     d0, [x29, #0x10]
0000000000000010        ldr     s0, [x29, #0x10]
0000000000000014        ldr     s1, [x29, #0x14]
0000000000000018        mov     sp, x29
000000000000001c        ldp     x29, x30, [sp], #0x50
0000000000000020        ret

@jandupej
Copy link
Member

jandupej commented Sep 7, 2023

Perf_Vector2 microbenchmarks on osx arm64 JIT-mini:

These are some impressive speedups, nice!

As for Vector2.Zero, what you did is correct. However our register allocator likes to spill and reload every value you create, especially in FP/SIMD (see the instructions at 0x0c and 0x10). If you load a constant instead, maybe it will forgo spilling and only load from memory (?). Still, I'd keep what you did. If there are future improvements to constant folding or the reg allocator, this will likely go away.

@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SamMonoRT
Copy link
Member

@LoopedBard3 - if the aot-llvm arm64 local testing script ready, please add a link to the documentation and @matouskozak you should try to get numbers for aot-llvm arm64 also if possible via that script.

@matouskozak
Copy link
Member Author

The test failures are tracked/unrelated to this PR.

@@ -3966,7 +3969,10 @@ mono_arch_output_basic_block (MonoCompile *cfg, MonoBasicBlock *bb)
case OP_EXPAND_R4:
case OP_EXPAND_R8: {
const int t = get_type_size_macro (ins->inst_c1);
arm_neon_fdup_e (code, VREG_FULL, t, dreg, sreg1, 0);
if (ins->opcode == OP_EXPAND_R8)
arm_neon_fdup_e (code, VREG_FULL, t, dreg, sreg1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OP_EXPAND_R8 can be simplified to a mov dreg, sreg1 or nothing if dreg == sreg1.

src/mono/mono/mini/mini-arm64.c Show resolved Hide resolved
Copy link
Member

@fanyang-mono fanyang-mono left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@matouskozak matouskozak merged commit 09e796a into dotnet:main Sep 21, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 21, 2023
@matouskozak matouskozak deleted the arm64-vector2-intrinsics branch October 3, 2024 13:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants