Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() in limited cases #84821

Merged
merged 44 commits into from
Jun 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
6b2df17
Enable EVEX feature: embedded broadcast
Ruihan-Yin Mar 20, 2023
a8c7d82
remove some irrelevent change from previous main.
Ruihan-Yin Apr 14, 2023
73fb02f
Enable containment at Broadcast intrinsic
Ruihan-Yin Apr 17, 2023
52cd44d
Convert the check logics on broadcast into a flag
Ruihan-Yin Apr 18, 2023
10d75c0
bug fixes:
Ruihan-Yin Apr 19, 2023
cdb6144
apply format patch.
Ruihan-Yin Apr 19, 2023
55a6bb7
Add "insOpts" data structure to xarch:
Ruihan-Yin Apr 25, 2023
5569217
Add "OperIsBroadcastScalar" check:
Ruihan-Yin Apr 25, 2023
4f92123
rebase the branch and resolve conflicts
Ruihan-Yin Apr 25, 2023
328549f
changes based on the reivews:
Ruihan-Yin Apr 26, 2023
f86a993
apply format patch
Ruihan-Yin Apr 27, 2023
d486ed4
bug fixes
Ruihan-Yin Apr 27, 2023
2c60838
bug fixes
Ruihan-Yin Apr 27, 2023
172861e
aaply format patch
Ruihan-Yin Apr 28, 2023
02c61c7
Enable embedded broadcast for Vector128<float>.Add
Ruihan-Yin May 1, 2023
2a6f8a7
Enable embedded broadcast for Vector512<float>.Add
Ruihan-Yin May 2, 2023
b036bcd
make double as embedded broadcast supported
Ruihan-Yin May 2, 2023
4358ee0
Add EB support to AVX_BroadcastScalarToVector*
Ruihan-Yin May 2, 2023
3a9093a
apply format patch
Ruihan-Yin May 2, 2023
d018d99
Enable embedded broadcast for double const vector
Ruihan-Yin May 3, 2023
7557db7
Enable embedded broadcast for integer Add.
Ruihan-Yin May 4, 2023
867eaf0
Changes based on the review:
Ruihan-Yin May 4, 2023
3f4d95b
removed the gentree flag: GTF_VECCON_FROMSCALAR
Ruihan-Yin May 5, 2023
32fd87a
Bug fixes on embedded broadcast with AVX_Broadcast
Ruihan-Yin May 5, 2023
4f97298
enable embedded broadcast in R_R_A path
Ruihan-Yin May 8, 2023
a5c4414
apply format patch
Ruihan-Yin May 8, 2023
12363a9
bug fixes:
Ruihan-Yin May 8, 2023
b561885
Changes based on reviews:
Ruihan-Yin May 11, 2023
90e27c4
unfold VecCon node when lowering if this node is
Ruihan-Yin May 4, 2023
9bfa325
apply format patch
Ruihan-Yin May 11, 2023
8072d29
bug fixes:
Ruihan-Yin May 11, 2023
7db1c5e
resolve the mishandling for the previous conflict.
Ruihan-Yin May 18, 2023
c916008
move the unfolding logic to ContainChecks
Ruihan-Yin May 19, 2023
4ee1f97
Code changes based on the review
Ruihan-Yin May 18, 2023
97cb23a
apply format patch
Ruihan-Yin May 19, 2023
37b57be
support embedded broadcast for GT_IND
Ruihan-Yin May 19, 2023
14a370a
bug fixes:
Ruihan-Yin May 19, 2023
64fec11
apply format patch
Ruihan-Yin May 19, 2023
45b7807
Introduce MakeHWIntrinsicSrcContained():
Ruihan-Yin May 19, 2023
cb8feb4
Code changes based on reviews:
Ruihan-Yin May 23, 2023
3fe0a2f
Code changes based on review
Ruihan-Yin May 23, 2023
6fb6e48
apply format patch
Ruihan-Yin May 23, 2023
3b3d0d1
Code changes based on review:
Ruihan-Yin May 23, 2023
36af7b7
Code changes based on review:
Ruihan-Yin May 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion src/coreclr/jit/codegeninterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,9 @@ class CodeGenInterface
#define INST_FP 0x01 // is it a FP instruction?
public:
static bool instIsFP(instruction ins);

#if defined(TARGET_XARCH)
static bool instIsEmbeddedBroadcastCompatible(instruction ins);
#endif // TARGET_XARCH
//-------------------------------------------------------------------------
// Liveness-related fields & methods
public:
Expand Down Expand Up @@ -764,6 +766,10 @@ class CodeGenInterface

virtual const char* siStackVarName(size_t offs, size_t size, unsigned reg, unsigned stkOffs) = 0;
#endif // LATE_DISASM

#if defined(TARGET_XARCH)
bool IsEmbeddedBroadcastEnabled(instruction ins, GenTree* op);
#endif
};

#endif // _CODEGEN_INTERFACE_H_
44 changes: 39 additions & 5 deletions src/coreclr/jit/emit.h
Original file line number Diff line number Diff line change
Expand Up @@ -781,6 +781,9 @@ class emitter
unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg
unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr
unsigned _idNoGC : 1; // Some helpers don't get recorded in GC tables
#if defined(TARGET_XARCH)
unsigned _idEvexbContext : 1; // does EVEX.b need to be set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reserves a bit for EVEX.b for all xarch instructions, even though very few will actually need it. Is there some other way to represent this data? E.g., new instrDesc types for those that need it, or maybe just for all EVEX encoded instructions, with extra fields for EVEX needs? For broadcast, could we create new insFormat values to use for memory reads that use broadcast (instead of using the actual memory type)?

Copy link
Member

@tannergooding tannergooding May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some notes, we're going to need potentially up to 7 new bits: #84821 (comment)

A few of this bits are used to represent +1 register for the kmask scenario. vfixupimm and vpternlog are 2 registers + 1 register or addressing mode + 1 constant (this is everything we support today). We then also have + 1 mask register + a bit for EVEX.b + a bit for EVEX.Z + a bit for EVEX.L'L (all of which need to be carried until the code is constructed).

The broadcast case is already going to require some addressing mode. The same EVEX.b bit may also be used to flag the rounding control or SAE control, however, which doesn't require addressing. The EVEX.Z bit is only used with the opmask register but the opmask can be used independently of addressing.

It would be nice to squeeze these bits into the spare padding we have available, if possible, but its not "required". Doing so may help throughput due to having less instrDesc* kinds to handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think more about this. Every bit that is put here takes away from "small" constants. We should also consider that code requiring EVEX information is likely to be extremely rare compared to non-EVEX instructions. For embedded broadcast, which requires an addressing mode, maybe emitAddrMode iiaAddrMode can carry the data, for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize these spare bits were used to hold the small constants 👍

For clarification, what is currently defined as a "small" constant? On the x86 side we have:

  • imm8 - most commonly used, used extensively by hwintrinsics and general instructions alike
  • imm16 - extremely rare as emitting 16-bit operations is itself very rare and typically more expensive than a 32-bit operation
  • imm32 - second most common and used by calls, jumps, etc
  • imm64 - 64-bit only and only for 1 instruction

My naive guess would then be that we want to reserve 8-bits for "small constants" and never shrink past that. 16/32-bit constants would then be considered non-small and part of the regular instrDesc. 64-bit constants would be "unique" since it impacts 1 instruction and is rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a "small constant" in the emitter is "whatever space is left over in the first 64 bits of the instrDesc". Basically, a "small" instrDesc is 64-bits. _idSmallCns takes up whatever space is left over after the instruction opcode, 2 registers, GC type, etc. It's 7-12 bits, currently, depending on architecture. If the value doesn't fit in that, we have to allocate an instrDescCns or some other instrDesc subclass to hold it. Adding more bits in the "small instrDesc" section pushes some constants to require a bigger instrDesc format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take responsibility for splitting this into its own instrDesc as part of adding the support for embedded rounding control and masking support.

#endif // TARGET_XARCH

#ifdef TARGET_ARM64
opSize _idOpSize : 3; // operand size: 0=1 , 1=2 , 2=4 , 3=8, 4=16
Expand Down Expand Up @@ -814,8 +817,8 @@ class emitter

////////////////////////////////////////////////////////////////////////
// Space taken up to here:
// x86: 46 bits
// amd64: 46 bits
// x86: 47 bits
Ruihan-Yin marked this conversation as resolved.
Show resolved Hide resolved
Ruihan-Yin marked this conversation as resolved.
Show resolved Hide resolved
// amd64: 47 bits
// arm: 48 bits
// arm64: 50 bits
// loongarch64: 46 bits
Expand All @@ -830,8 +833,10 @@ class emitter
#define ID_EXTRA_BITFIELD_BITS (16)
#elif defined(TARGET_ARM64)
#define ID_EXTRA_BITFIELD_BITS (18)
#elif defined(TARGET_XARCH) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
#elif defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
#define ID_EXTRA_BITFIELD_BITS (14)
#elif defined(TARGET_XARCH)
#define ID_EXTRA_BITFIELD_BITS (15)
Ruihan-Yin marked this conversation as resolved.
Show resolved Hide resolved
#else
#error Unsupported or unset target architecture
#endif
Expand Down Expand Up @@ -866,8 +871,8 @@ class emitter

////////////////////////////////////////////////////////////////////////
// Space taken up to here (with/without prev offset, assuming host==target):
// x86: 52/48 bits
// amd64: 53/48 bits
// x86: 53/49 bits
// amd64: 54/49 bits
// arm: 54/50 bits
// arm64: 57/52 bits
// loongarch64: 53/48 bits
Expand Down Expand Up @@ -1529,6 +1534,19 @@ class emitter
_idNoGC = val;
}

#ifdef TARGET_XARCH
bool idIsEvexbContext() const
{
return _idEvexbContext != 0;
}
void idSetEvexbContext()
{
assert(_idEvexbContext == 0);
_idEvexbContext = 1;
assert(_idEvexbContext == 1);
}
#endif

#ifdef TARGET_ARMARCH
bool idIsLclVar() const
{
Expand Down Expand Up @@ -3655,9 +3673,25 @@ inline unsigned emitter::emitGetInsCIargs(instrDesc* id)
//
emitAttr emitter::emitGetMemOpSize(instrDesc* id) const
{

emitAttr defaultSize = id->idOpSize();
instruction ins = id->idIns();
if (id->idIsEvexbContext())
{
// should have the assumption that Evex.b now stands for the embedded broadcast context.
// reference: Section 2.7.5 in Intel 64 and ia-32 architectures software developer's manual volume 2.
ssize_t inputSize = GetInputSizeInBytes(id);
switch (inputSize)
{
case 4:
return EA_4BYTE;
case 8:
return EA_8BYTE;

default:
unreached();
}
}
switch (ins)
{
case INS_pextrb:
Expand Down
73 changes: 60 additions & 13 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1231,9 +1231,10 @@ bool emitter::TakesEvexPrefix(const instrDesc* id) const
#define DEFAULT_BYTE_EVEX_PREFIX_MASK 0xFFFFFFFF00000000ULL
#define LBIT_IN_BYTE_EVEX_PREFIX 0x0000002000000000ULL
#define LPRIMEBIT_IN_BYTE_EVEX_PREFIX 0x0000004000000000ULL
#define EVEX_B_BIT 0x0000001000000000ULL

//------------------------------------------------------------------------
// AddEvexPrefix: Add default EVEX perfix with only LL' bits set.
// AddEvexPrefix: Add default EVEX prefix with only LL' bits set.
//
// Arguments:
// ins -- processor instruction to check.
Expand Down Expand Up @@ -1268,6 +1269,22 @@ emitter::code_t emitter::AddEvexPrefix(instruction ins, code_t code, emitAttr at
return code;
}

//------------------------------------------------------------------------
// AddEvexPrefix: set Evex.b bit if EvexbContext is set in instruction descritor.
//
// Arguments:
// code -- opcode bits.
//
// Return Value:
// encoded code with Evex.b set if needed.
//
emitter::code_t emitter::AddEvexbBit(code_t code)
{
assert(hasEvexPrefix(code));
code |= EVEX_B_BIT;
return code;
}

// Returns true if this instruction requires a VEX prefix
// All AVX instructions require a VEX prefix
bool emitter::TakesVexPrefix(instruction ins) const
Expand Down Expand Up @@ -6667,7 +6684,8 @@ void emitter::emitIns_R_S_I(instruction ins, emitAttr attr, regNumber reg1, int
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir)
void emitter::emitIns_R_R_A(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir, insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6678,6 +6696,11 @@ void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regN
id->idIns(ins);
id->idReg1(reg1);
id->idReg2(reg2);
if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}

emitHandleMemOp(indir, id, (ins == INS_mulx) ? IF_RWR_RWR_ARD : emitInsModeFormat(ins, IF_RRD_RRD_ARD), ins);

Expand Down Expand Up @@ -6778,8 +6801,13 @@ void emitter::emitIns_R_AR_R(instruction ins,
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_C(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, CORINFO_FIELD_HANDLE fldHnd, int offs)
void emitter::emitIns_R_R_C(instruction ins,
emitAttr attr,
regNumber reg1,
regNumber reg2,
CORINFO_FIELD_HANDLE fldHnd,
int offs,
insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6797,6 +6825,11 @@ void emitter::emitIns_R_R_C(
id->idReg1(reg1);
id->idReg2(reg2);
id->idAddr()->iiaFieldHnd = fldHnd;
if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}

UNATIVE_OFFSET sz = emitInsSizeCV(id, insCodeRM(ins));
id->idCodeSize(sz);
Expand Down Expand Up @@ -6829,7 +6862,8 @@ void emitter::emitIns_R_R_R(instruction ins, emitAttr attr, regNumber targetReg,
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs)
void emitter::emitIns_R_R_S(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs, insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6842,6 +6876,11 @@ void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regN
id->idReg2(reg2);
id->idAddr()->iiaLclVar.initLclVarAddr(varx, offs);

if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}
#ifdef DEBUG
id->idDebugOnlyInfo()->idVarRefOffs = emitVarRefOffs;
#endif
Expand Down Expand Up @@ -8126,14 +8165,15 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber targe
// indir -- The GenTreeIndir used for the memory address
//
void emitter::emitIns_SIMD_R_R_A(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir)
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir, insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir);
emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir, instOptions);
}
else
{
Ruihan-Yin marked this conversation as resolved.
Show resolved Hide resolved
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_A(ins, attr, targetReg, indir);
}
Expand All @@ -8151,15 +8191,21 @@ void emitter::emitIns_SIMD_R_R_A(
// fldHnd -- The CORINFO_FIELD_HANDLE used for the memory address
// offs -- The offset added to the memory address from fldHnd
//
void emitter::emitIns_SIMD_R_R_C(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, CORINFO_FIELD_HANDLE fldHnd, int offs)
void emitter::emitIns_SIMD_R_R_C(instruction ins,
emitAttr attr,
regNumber targetReg,
regNumber op1Reg,
CORINFO_FIELD_HANDLE fldHnd,
int offs,
insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs);
emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs, instOptions);
}
else
{
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_C(ins, attr, targetReg, fldHnd, offs);
}
Expand Down Expand Up @@ -8214,14 +8260,15 @@ void emitter::emitIns_SIMD_R_R_R(
// offs -- The offset added to the memory address from varx
//
void emitter::emitIns_SIMD_R_R_S(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs)
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs, insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs);
emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs, instOptions);
}
else
{
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_S(ins, attr, targetReg, varx, offs);
}
Expand Down Expand Up @@ -15709,7 +15756,7 @@ BYTE* emitter::emitOutputLJ(insGroup* ig, BYTE* dst, instrDesc* i)
// Return Value:
// size in bytes.
//
ssize_t emitter::GetInputSizeInBytes(instrDesc* id)
ssize_t emitter::GetInputSizeInBytes(instrDesc* id) const
{
insFlags inputSize = static_cast<insFlags>((CodeGenInterface::instInfo[id->idIns()] & Input_Mask));

Expand Down
Loading