-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT ARM64-SVE: Add Sve.Abs() and Sve.Add() #100134
Conversation
Note regarding the
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
From Sve_Add_int.cs: public void RunBasicScenario_UnsafeRead()
{
TestLibrary.TestFramework.BeginScenario(nameof(RunBasicScenario_UnsafeRead));
var result = Sve.Add(
Unsafe.Read<Vector<Int32>>(_dataTable.inArray1Ptr),
Unsafe.Read<Vector<Int32>>(_dataTable.inArray2Ptr)
);
Unsafe.Write(_dataTable.outArrayPtr, result);
ValidateResult(_dataTable.inArray1Ptr, _dataTable.inArray2Ptr, _dataTable.outArrayPtr);
} Disassembly with note the use of
|
Without
|
Change-Id: Ie8cfe828595da9a87adbc0857c0c44c0ce12f5b2
@kunalspathak @tannergooding - FYI. This is almost ready (bar the incorrect result). |
Fixed the issues - the scaling was wrong in emitIns_R_S/emitIns_S_R. However, this does make me concerned we're not consistent between general purpose instructions and SVE. This is ready for review now. cc @dotnet/arm64-contrib |
The fixed version of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I need to understand your comment about emitarm64.cpp
a bit more. Will check offline.
if (targetReg != op2Reg) | ||
{ | ||
assert(targetReg != op3Reg); | ||
GetEmitter()->emitIns_Mov(INS_mov, emitTypeSize(node), targetReg, op2Reg, /* canSkip */ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is needed here, but we should start document the rules of using movprfx
instruction. Do you have a good documentation that explains it and where it is used? I remember seeing it long back but not sure what it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same general mov
needed for any RMW
instruction since you need the destination register and one of the input registers to be the same. The only really different thing is that movprfx
is specifically the one you want for SVE instructions that have an embedded mask. -- The potentially confusing part here is that it's checking IsEmbeddedMaskedOperation
when that check as currently implemented just means "I have a mask, but its all true" (I gave feedback that should be renamed above)
So I'm not sure there's really any rules to document here, we certainly don't document it at all for x86 or x64 or any of the existing Arm64 RMW cases.
I've redone the those changes. Just needed to make sure we scale by the right size (the size of the vector or predicate). |
So if I see https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-100134-merge-c2bf8feefece4940bf/HardwareIntrinsics_Arm_r/1/console.7afb876d.log?helixlogtype=result , these tests seems to run on non-sve hardware, which might be something that we don't want currently, specially because 1) we do not have the SVE hardware in lab, 2) We are testing the same thing for all the tests which is that it throws In long term, we should find a way to include/exclude arm64 tests based on if they are running on sve vs. non-sve hardware, something that I have asked @TIHan to investigate. |
Spoke to @tannergooding and for now, we should continue running the sve tests on non-sve hardware to test |
src/coreclr/jit/emitarm64.cpp
Outdated
@@ -9836,7 +9836,7 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va | |||
fmt = IF_SVE_IE_2A; | |||
|
|||
// TODO-SVE: Don't assume 128bit vectors | |||
scale = NaturalScale_helper(EA_16BYTE); | |||
scale = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why is this change? It will return 4
here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why is this change? It will return 4 here, right?
Agreed. The changes for the mask load/stores were the required ones.
Reverted.
ping @tannergooding |
@@ -7875,7 +7875,7 @@ void emitter::emitIns_R_S(instruction ins, emitAttr attr, regNumber reg1, int va | |||
|
|||
// TODO-SVE: Don't assume 128bit vectors | |||
// Predicate size is vector length / 8 | |||
scale = NaturalScale_helper(EA_2BYTE); | |||
scale = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the todo above, are we potentially missing an assert which would ensure this is updated for future instructions that need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the todo above, are we potentially missing an assert which would ensure this is updated for future instructions that need it?
This is within code just for predicates. So will always be valid as long as the vector length is 128bits.
There is a question of what testing to do for >128bit vectors for .Net9. I suspect a lot of assumptions are made elsewhere that vector length is 128bits, and will require some major debugging. At some point I can do some testing on larger vector length machines. Due to time constraints, maybe the solution for .Net9 is have a check on startup: if vector length is >128bits then ask kernel to reduce to 128bits? Or just disable on >128 bits machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That question probably needs a much broader discussion.
There's notably not a lot of benefit of 128-bit SVE over AdvSimd. There's a few new instructions and the ability to emit denser assembly in a few cases, but most of that isn't for the typical hot loop of a method and in some case the code can be less dense (emitting SVE Abs
requires predication and a ptrue
to be generated; while AdvSimd Abs
does not, so given 128-bit vectors and no predication; AdvSimd is better to use for that instruction).
At the same time, there is hardware (AWS Graviton) that has 256-bit SVE support that will most likely run on .NET 9. So it would probably be best if we could ensure it is appropriately handled and we're best able to take advantage of such hardware, not artificially limit it.
break; | ||
|
||
case 3: | ||
op3 = getArgForHWIntrinsic(sigReader.GetOp3Type(), sigReader.op3ClsHnd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no case where a range check is needed for 3 arguments?
Should there be an assert to validate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no case where a range check is needed for 3 arguments?
That seems to be the case.
The case for 3 args is checked further down and depending on the intrinsics, does the addRangeCheckIfNeeded
on appropriate arg. Not sure if we should still add assert.
default: | ||
break; | ||
} | ||
op1 = gtNewSimdEmbeddedMaskNode(simdBaseJitType, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be done here?
This seems like its just inserting the implicit AllTrue
mask that some instructions require, which is effectively allocating and forcing an extra node to be carried through all of HIR when the high level operation doesn't actually care about it.
Seemingly this could just be inserted as part of lowering instead so that it only impacts LSRA and codegen?
break; | ||
|
||
case 3: | ||
GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op2Reg, op3Reg, opt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an assert that none of the input registers are mask registers?
I'd guess the only RMW instructions with masks are ones that have at least 4 operands, so we shouldn't ever see that here.
@@ -17,6 +17,10 @@ | |||
// SVE Intrinsics | |||
|
|||
// Sve | |||
HARDWARE_INTRINSIC(Sve, Abs, -1, -1, false, {INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_fabs, INS_sve_fabs}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the Scalable
flag? Can that not just be detected by the InstructionSet
being Sve
?
@@ -17,6 +17,10 @@ | |||
// SVE Intrinsics | |||
|
|||
// Sve | |||
HARDWARE_INTRINSIC(Sve, Abs, -1, -1, false, {INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_fabs, INS_sve_fabs}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation) | |||
|
|||
HARDWARE_INTRINSIC(Sve, Add, -1, -1, false, {INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_fadd, INS_sve_fadd}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit on what LowMaskedOperation
means?
I'm looking at the architecture manual and don't see any limitations on the Add (predicated) instruction. The attached comment to the enum entry is
// The intrinsic uses a mask in arg1 to select elements present in the result, and must use a low register.`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit on what
LowMaskedOperation
means?
The instruction only has 3bits for a predicate register so is limited to using predicate registers 0 to 7. This is quite a common pattern across Sve, hence using an common enum for it (I wouldn't do similar for the handful have only 2 bits)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I think a flag makes sense then, but I'm not a huge fan of the name given how low
is used in various other contexts.
Maybe something more explicit like RestrictedPredicateRegisterSet
(or an alternative name) would work and make it clearer what the mask means/implies?
@@ -17,6 +17,10 @@ | |||
// SVE Intrinsics | |||
|
|||
// Sve | |||
HARDWARE_INTRINSIC(Sve, Abs, -1, -1, false, {INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_abs, INS_invalid, INS_sve_fabs, INS_sve_fabs}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation) | |||
|
|||
HARDWARE_INTRINSIC(Sve, Add, -1, -1, false, {INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_add, INS_sve_fadd, INS_sve_fadd}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this one in particular need EmbeddedMaskedOperation
, the manual has entries for both Add (vectors, predicated)
and Add (vectors, unpredicated)
. Both entries look to be for SVE1
/// Abs : Absolute value | ||
|
||
/// <summary> | ||
/// svint8_t svabs[_s8]_m(svint8_t inactive, svbool_t pg, svint8_t op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We've historically also included the primary underlying instruction that gets emitted for a given API.
For example, with AdvSimd
:
/// <summary>
/// float64x2_t vabsq_f64 (float64x2_t a)
/// A64: FABS Vd.2D, Vn.2D
/// </summary>
In this case, I'd expect something like A64: ABS Zd.B, Pg/M, Zn.B
(assuming I got that mostly right)
public static System.Numerics.Vector<long> Abs(System.Numerics.Vector<long> value) { throw null; } | ||
public static System.Numerics.Vector<float> Abs(System.Numerics.Vector<float> value) { throw null; } | ||
public static System.Numerics.Vector<double> Abs(System.Numerics.Vector<double> value) { throw null; } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no newlines when this is generated using the tooling (the tooling doesn't always work, unfortunately), so we shouldn't add any here ourselves either.
public static System.Numerics.Vector<sbyte> Abs(System.Numerics.Vector<sbyte> value) { throw null; } | ||
public static System.Numerics.Vector<short> Abs(System.Numerics.Vector<short> value) { throw null; } | ||
public static System.Numerics.Vector<int> Abs(System.Numerics.Vector<int> value) { throw null; } | ||
public static System.Numerics.Vector<long> Abs(System.Numerics.Vector<long> value) { throw null; } | ||
public static System.Numerics.Vector<float> Abs(System.Numerics.Vector<float> value) { throw null; } | ||
public static System.Numerics.Vector<double> Abs(System.Numerics.Vector<double> value) { throw null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tooling also expects these to be in alphabetical order, based on the fully qualified type name.
So byte, double, short, int, long, nint, sbyte, float, ushort, uint, ulong, nuint
(corresponding to Byte, Double, Int16, Int32, Int64, IntPtr, SByte, Single, UInt16, UInt32, UInt64, UIntPtr
)
@@ -0,0 +1,328 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's "different" about this compared to the _BinaryOpTestTemplate.template
?
I would have expected them to be essentially identical and any minor differences could likely have been adjusted, rather than requiring an entirely new template.
@@ -2887,6 +2889,24 @@ | |||
|
|||
(string templateFileName, Dictionary<string, string> templateData)[] SveInputs = new [] | |||
{ | |||
("SveSimpleVecOpTest.template", new Dictionary<string, string> { ["TestName"] = "Sve_Abs_float", ["Isa"] = "Sve", ["LoadIsa"] = "Sve", ["Method"] = "Abs", ["RetVectorType"] = "Vector", ["RetBaseType"] = "Single", ["Op1VectorType"] = "Vector", ["Op1BaseType"] = "Single", ["LargestVectorSize"] = "8", ["NextValueOp1"] = "-TestLibrary.Generator.GetSingle()", ["ValidateIterResult"] = "Helpers.Abs(firstOp[i]) != result[i]"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is LargestVectorSize == 8
, presumably it should be more like 256
which is the largest supported SVE vector length?
Or at worst it should be 16
(the minimum SVE vector length), 32
(the largest size in server grade hardware), or 64
(the largest size in any SVE implementation to date, notably only in a supercomputer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think the same as I commented in #100366 (comment).
This functionality got merged as part of #100743 |
Adds to API methods which require embedded masks (There is no mask parameter in the API call, but within the IR a truemask needs adding before calling codegen). Also adds all the embedded mask code.