-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Optimize constant V512 vector with broadcast #92017
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis PR is trying to solve #90328. The optimization is implemented by replacing the constant V512 vector by a V128 and a Currently, the implementation only covers
|
Ran the test suite twice, should be some known or random fails, turning PR to ready for review. |
Is this going to perform better or just save on the size of the rodata section? What about on hardware without AVX-512 ( What about For AVX-512, the |
The expected improvement is saving some memory space for constant values.
I was intended to use
I presume the scope is to achieve compressing larger existing constant vector to smaller vector in a pure memory operation, which embedded broadcast might not be able to handle. If we want to take compressing to scalar into consideration, we might also have the opportunity: V128/256/512 ->Byte/Word/DWord/QWord.
From my understanding of #90328, the issue is for a pure store instruction case, then the code gen is mostly:
the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand. I might get the issue wrong or incompletely, so please correct me if I have any misunderstanding. |
👍, if its primarily for the case where we'd otherwise have a I was initially concerned it would also change:
into
|
I think it wouldn't cover that case (at least it is not intended to cover), as the entry point of this opt is |
Fail should be unrelated. Hi, @tannergooding @EgorBo, I added the optimization for V512->V256 and V256->V128, and I think it reaches the expected coverage and ready for the reviews. |
@tannergooding, this community PR is ready to review. PTAL. |
src/coreclr/jit/lowerxarch.cpp
Outdated
return; | ||
} | ||
|
||
if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can just do:
if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64)) | |
if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32, TYP_SIMD64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This should get a secondary review from someone on the JIT team
CC. @dotnet/jit-contrib
CC. @jakobbotsch, @EgorBo in particular |
56041f3
to
a165107
Compare
a165107
to
a01fd58
Compare
Hi @jakobbotsch @EgorBo, this PR is ready for review, would you please take a look? Thanks! |
Since this is AVX-512 backend work I think @BruceForstall should take a look... On my quick glance it seemed a bit odd to do it during |
Thanks everyone for reviewing on this! |
This PR is trying to solve #90328.
The optimization is implemented by replacing the constant V512 vector by a V128 and a
broadcasti128
node when loweringGT_STOREIND
plus an eligible constant V512 vector as its operand.Currently, the implementation only covers
V512 -> broadcasti128(V128)
, we are open to adjust the implementation or bring more situations into this PR, ideallyV512/256 -> broadcasti128(V128)
, when AVX512 is available. (Possibly plusV512 -> broadcast64x4(V256)
.)