[JIT] Optimize constant V512 vector with broadcast #92017

Ruihan-Yin · 2023-09-13T18:07:41Z

This PR is trying to solve #90328.

The optimization is implemented by replacing the constant V512 vector by a V128 and a broadcasti128 node when lowering GT_STOREIND plus an eligible constant V512 vector as its operand.

Currently, the implementation only covers V512 -> broadcasti128(V128), we are open to adjust the implementation or bring more situations into this PR, ideally V512/256 -> broadcasti128(V128), when AVX512 is available. (Possibly plus V512 -> broadcast64x4(V256).)

ghost · 2023-09-13T18:07:53Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR is trying to solve #90328.

The optimization is implemented by replacing the constant V512 vector by a V128 and a broadcasti128 node when lowering GT_STOREIND plus an eligible constant V512 vector as its operand.

Currently, the implementation only covers V512 -> broadcasti128(V128), we are open to adjust the implementation or bring more situations into this PR, ideally V512/256 -> broadcasti128(V128), when AVX512 is available. (Possibly plus V512 -> broadcast64x4(V256).)

Author:	Ruihan-Yin
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

Ruihan-Yin · 2023-09-14T01:53:29Z

Ran the test suite twice, should be some known or random fails, turning PR to ready for review.

Ruihan-Yin · 2023-09-18T18:43:28Z

Hi @EgorBo, this PR is ready for review, please see if this PR is able to cover #90328, thanks!

tannergooding · 2023-09-18T19:00:17Z

Is this going to perform better or just save on the size of the rodata section?

What about on hardware without AVX-512 (Haswell, Skylake, etc)?

What about scalar->V128, scalar->V256, scalar->V512, V128->V256, V256->V512, etc?

For AVX-512, the scalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

Ruihan-Yin · 2023-09-18T19:30:45Z

Is this going to perform better or just save on the size of the rodata section?

The expected improvement is saving some memory space for constant values.

What about on hardware without AVX-512 (Haswell, Skylake, etc)?

I was intended to use VBROADCASTI32X4, which is an AVX512 only instruction, but seems VBROADCASTI128 can also do this job for V256->V128.

What about scalar->V128, scalar->V256, scalar->V512, V128->V256, V256->V512, etc?

I presume the scope is to achieve compressing larger existing constant vector to smaller vector in a pure memory operation, which embedded broadcast might not be able to handle.

If we want to take compressing to scalar into consideration, we might also have the opportunity: V128/256/512 ->Byte/Word/DWord/QWord.

For AVX-512, the scalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

From my understanding of #90328, the issue is for a pure store instruction case, then the code gen is mostly:

vmovups zmm, zmmword ptr[constant section]
vmovups zmmword ptr[target], zmm

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

I might get the issue wrong or incompletely, so please correct me if I have any misunderstanding.

tannergooding · 2023-09-18T21:51:57Z

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

👍, if its primarily for the case where we'd otherwise have a vmovups reg1, [addr] then it sounds great to replace that with vbroadcast reg1, [addr] where possible.

I was initially concerned it would also change:

vadd reg1, reg2, [addr]

into

vbroadcast reg3, [addr]
vadd reg1, reg2, reg3

Ruihan-Yin · 2023-09-18T22:40:33Z

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

👍, if its primarily for the case where we'd otherwise have a vmovups reg1, [addr] then it sounds great to replace that with vbroadcast reg1, [addr] where possible.

I was initially concerned it would also change:
vadd reg1, reg2, [addr]
into
vbroadcast reg3, [addr]
vadd reg1, reg2, reg3

I think it wouldn't cover that case (at least it is not intended to cover), as the entry point of this opt is LowerStoreIndir().

Ruihan-Yin · 2023-09-21T16:22:24Z

Fail should be unrelated.

Hi, @tannergooding @EgorBo, I added the optimization for V512->V256 and V256->V128, and I think it reaches the expected coverage and ready for the reviews.

JulieLeeMSFT · 2023-10-23T16:19:28Z

@tannergooding, this community PR is ready to review. PTAL.

tannergooding · 2023-10-24T15:42:49Z

src/coreclr/jit/lowerxarch.cpp

+            return;
+        }
+
+        if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))


I believe you can just do:

Suggested change

if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))

if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32, TYP_SIMD64))

tannergooding

LGTM. This should get a secondary review from someone on the JIT team

CC. @dotnet/jit-contrib

tannergooding · 2023-10-24T15:45:58Z

CC. @jakobbotsch, @EgorBo in particular

Ruihan-Yin · 2023-11-20T19:54:03Z

Hi @jakobbotsch @EgorBo, this PR is ready for review, would you please take a look? Thanks!

jakobbotsch · 2023-11-20T22:59:42Z

Since this is AVX-512 backend work I think @BruceForstall should take a look... On my quick glance it seemed a bit odd to do it during STORE_INDIR lowering when presumably constants can benefit in many other cases (as long as they're not already contained), but I am not very familiar with these instructions. Also going to close and reopen this to rerun CI.

BruceForstall · 2023-11-21T20:49:48Z

Diffs

Ruihan-Yin · 2023-11-21T21:39:48Z

Thanks everyone for reviewing on this!

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Sep 13, 2023

Ruihan-Yin changed the title ~~[JIT] Optimize constant V512 vector~~ [JIT] Optimize constant V512 vector with broadcast Sep 13, 2023

Ruihan-Yin closed this Sep 13, 2023

Ruihan-Yin reopened this Sep 13, 2023

build-analysis bot mentioned this pull request Sep 14, 2023

Build crashes in System.Runtime.Serialization.SerializationGuard #92007

Closed

Ruihan-Yin marked this pull request as ready for review September 14, 2023 23:14

EgorBo self-requested a review October 16, 2023 10:33

JulieLeeMSFT requested a review from tannergooding October 23, 2023 16:19

tannergooding reviewed Oct 24, 2023

View reviewed changes

tannergooding approved these changes Oct 24, 2023

View reviewed changes

Ruihan-Yin added 8 commits October 25, 2023 09:57

Use Broadcasti128

d264c75

remove un-needed changes

baf3bcc

Nit: remove some unnecessary commetns and line deletion.

698b72e

filter out the AllBitsSet and Zero vector from the opts

afda4de

Apply format patch

e9d6ca7

extend the coverage to V512->V256 and V256->V128

4a3bf08

apply format patch

5352998

Resolve comment

a01fd58

Ruihan-Yin force-pushed the broadcastMov branch from 56041f3 to a165107 Compare October 25, 2023 17:11

Ruihan-Yin force-pushed the broadcastMov branch from a165107 to a01fd58 Compare October 25, 2023 17:14

jakobbotsch closed this Nov 20, 2023

jakobbotsch reopened this Nov 20, 2023

jakobbotsch requested a review from BruceForstall November 20, 2023 22:59

BruceForstall approved these changes Nov 21, 2023

View reviewed changes

BruceForstall merged commit a672cf1 into dotnet:main Nov 21, 2023
136 of 139 checks passed

Ruihan-Yin mentioned this pull request Nov 28, 2023

Optimize constant AVX512/AVX2 vectors with broadcast #90328

Open

github-actions bot locked and limited conversation to collaborators Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] Optimize constant V512 vector with broadcast #92017

[JIT] Optimize constant V512 vector with broadcast #92017

Ruihan-Yin commented Sep 13, 2023

ghost commented Sep 13, 2023

Ruihan-Yin commented Sep 14, 2023 •

edited

Loading

Ruihan-Yin commented Sep 18, 2023

tannergooding commented Sep 18, 2023 •

edited

Loading

Ruihan-Yin commented Sep 18, 2023

tannergooding commented Sep 18, 2023

Ruihan-Yin commented Sep 18, 2023

Ruihan-Yin commented Sep 21, 2023

JulieLeeMSFT commented Oct 23, 2023

tannergooding Oct 24, 2023

tannergooding left a comment

tannergooding commented Oct 24, 2023

Ruihan-Yin commented Nov 20, 2023

jakobbotsch commented Nov 20, 2023

BruceForstall commented Nov 21, 2023

Ruihan-Yin commented Nov 21, 2023

	if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))
	if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32, TYP_SIMD64))

[JIT] Optimize constant V512 vector with broadcast #92017

[JIT] Optimize constant V512 vector with broadcast #92017

Conversation

Ruihan-Yin commented Sep 13, 2023

ghost commented Sep 13, 2023

Ruihan-Yin commented Sep 14, 2023 • edited Loading

Ruihan-Yin commented Sep 18, 2023

tannergooding commented Sep 18, 2023 • edited Loading

Ruihan-Yin commented Sep 18, 2023

tannergooding commented Sep 18, 2023

Ruihan-Yin commented Sep 18, 2023

Ruihan-Yin commented Sep 21, 2023

JulieLeeMSFT commented Oct 23, 2023

tannergooding Oct 24, 2023

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

tannergooding commented Oct 24, 2023

Ruihan-Yin commented Nov 20, 2023

jakobbotsch commented Nov 20, 2023

BruceForstall commented Nov 21, 2023

Ruihan-Yin commented Nov 21, 2023

Ruihan-Yin commented Sep 14, 2023 •

edited

Loading

tannergooding commented Sep 18, 2023 •

edited

Loading