Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] Optimize constant V512 vector with broadcast #92017

Merged
merged 8 commits into from
Nov 21, 2023

Conversation

Ruihan-Yin
Copy link
Contributor

This PR is trying to solve #90328.

The optimization is implemented by replacing the constant V512 vector by a V128 and a broadcasti128 node when lowering GT_STOREIND plus an eligible constant V512 vector as its operand.

Currently, the implementation only covers V512 -> broadcasti128(V128), we are open to adjust the implementation or bring more situations into this PR, ideally V512/256 -> broadcasti128(V128), when AVX512 is available. (Possibly plus V512 -> broadcast64x4(V256).)

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 13, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Sep 13, 2023
@ghost
Copy link

ghost commented Sep 13, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR is trying to solve #90328.

The optimization is implemented by replacing the constant V512 vector by a V128 and a broadcasti128 node when lowering GT_STOREIND plus an eligible constant V512 vector as its operand.

Currently, the implementation only covers V512 -> broadcasti128(V128), we are open to adjust the implementation or bring more situations into this PR, ideally V512/256 -> broadcasti128(V128), when AVX512 is available. (Possibly plus V512 -> broadcast64x4(V256).)

Author: Ruihan-Yin
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@Ruihan-Yin Ruihan-Yin changed the title [JIT] Optimize constant V512 vector [JIT] Optimize constant V512 vector with broadcast Sep 13, 2023
@Ruihan-Yin Ruihan-Yin closed this Sep 13, 2023
@Ruihan-Yin Ruihan-Yin reopened this Sep 13, 2023
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Sep 14, 2023

Ran the test suite twice, should be some known or random fails, turning PR to ready for review.

@Ruihan-Yin Ruihan-Yin marked this pull request as ready for review September 14, 2023 23:14
@Ruihan-Yin
Copy link
Contributor Author

Hi @EgorBo, this PR is ready for review, please see if this PR is able to cover #90328, thanks!

@tannergooding
Copy link
Member

tannergooding commented Sep 18, 2023

Is this going to perform better or just save on the size of the rodata section?

What about on hardware without AVX-512 (Haswell, Skylake, etc)?

What about scalar->V128, scalar->V256, scalar->V512, V128->V256, V256->V512, etc?

For AVX-512, the scalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

@Ruihan-Yin
Copy link
Contributor Author

Is this going to perform better or just save on the size of the rodata section?

The expected improvement is saving some memory space for constant values.

What about on hardware without AVX-512 (Haswell, Skylake, etc)?

I was intended to use VBROADCASTI32X4, which is an AVX512 only instruction, but seems VBROADCASTI128 can also do this job for V256->V128.

What about scalar->V128, scalar->V256, scalar->V512, V128->V256, V256->V512, etc?

I presume the scope is to achieve compressing larger existing constant vector to smaller vector in a pure memory operation, which embedded broadcast might not be able to handle.

If we want to take compressing to scalar into consideration, we might also have the opportunity: V128/256/512 ->Byte/Word/DWord/QWord.

For AVX-512, the scalar->Vector scenario can at least be covered by an embedded broadcast. But for other scenarios, this seems like its trading more instructions for smaller data section.

From my understanding of #90328, the issue is for a pure store instruction case, then the code gen is mostly:

vmovups zmm, zmmword ptr[constant section]
vmovups zmmword ptr[target], zmm

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

I might get the issue wrong or incompletely, so please correct me if I have any misunderstanding.

@tannergooding
Copy link
Member

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

👍, if its primarily for the case where we'd otherwise have a vmovups reg1, [addr] then it sounds great to replace that with vbroadcast reg1, [addr] where possible.

I was initially concerned it would also change:

vadd reg1, reg2, [addr]

into

vbroadcast reg3, [addr]
vadd reg1, reg2, reg3

@Ruihan-Yin
Copy link
Contributor Author

the optimization is essentially replacing the first load with a broadcast instruction with a smaller constant operand.

👍, if its primarily for the case where we'd otherwise have a vmovups reg1, [addr] then it sounds great to replace that with vbroadcast reg1, [addr] where possible.

I was initially concerned it would also change:

vadd reg1, reg2, [addr]

into

vbroadcast reg3, [addr]
vadd reg1, reg2, reg3

I think it wouldn't cover that case (at least it is not intended to cover), as the entry point of this opt is LowerStoreIndir().

@Ruihan-Yin
Copy link
Contributor Author

Fail should be unrelated.

Hi, @tannergooding @EgorBo, I added the optimization for V512->V256 and V256->V128, and I think it reaches the expected coverage and ready for the reviews.

@EgorBo EgorBo self-requested a review October 16, 2023 10:33
@JulieLeeMSFT
Copy link
Member

@tannergooding, this community PR is ready to review. PTAL.

return;
}

if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can just do:

Suggested change
if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32) && !node->Data()->AsVecCon()->TypeIs(TYP_SIMD64))
if (!node->Data()->AsVecCon()->TypeIs(TYP_SIMD32, TYP_SIMD64))

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This should get a secondary review from someone on the JIT team

CC. @dotnet/jit-contrib

@tannergooding
Copy link
Member

CC. @jakobbotsch, @EgorBo in particular

@Ruihan-Yin
Copy link
Contributor Author

Hi @jakobbotsch @EgorBo, this PR is ready for review, would you please take a look? Thanks!

@jakobbotsch
Copy link
Member

Since this is AVX-512 backend work I think @BruceForstall should take a look... On my quick glance it seemed a bit odd to do it during STORE_INDIR lowering when presumably constants can benefit in many other cases (as long as they're not already contained), but I am not very familiar with these instructions. Also going to close and reopen this to rerun CI.

@BruceForstall
Copy link
Member

Diffs

@BruceForstall BruceForstall merged commit a672cf1 into dotnet:main Nov 21, 2023
136 of 139 checks passed
@Ruihan-Yin
Copy link
Contributor Author

Thanks everyone for reviewing on this!

@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants