Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166

tannergooding · 2018-08-17T16:07:33Z

As per the comment here: dotnet/corefx#31779 (comment)

I would expect that an explicit Load + four Permute operations (for four sequential memory addresses) would be faster than (or at least as fast as) four SetAllVector128 (which should be equivalent to four loads and four permutes).

Investigate the codegen between the two to see if there is some bug blocking this optimization.

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-08-17T16:08:09Z

CC. @fiigii, @eerhardt, @CarolEidt

Also CC. @EgorBo

tannergooding · 2018-08-17T16:08:24Z

I've self assigned and should get to this on Monday.

EgorBo · 2018-08-17T16:37:29Z

@tannergooding turns out I was benchmarking that test on some old 2.1 runtime, with the latest 3.0 Shuffle is 20% faster.
JIT output:
Multiply-SetAllVector.cs: https://gist.github.com/EgorBo/84ae4be8f4a8024548615e486e603962
Multiply-Shuffle.cs: https://gist.github.com/EgorBo/c44417038ad56e8ec145ea1945257a36
C#: https://gist.github.com/EgorBo/e2717606fdc4c3a14a7fa16617f87ad8

tannergooding · 2018-08-17T16:41:19Z

Thanks for the update @EgorBo!

tannergooding self-assigned this Aug 17, 2018

tannergooding closed this as completed Aug 17, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

tannergooding removed their assignment May 26, 2020

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166

tannergooding commented Aug 17, 2018

tannergooding commented Aug 17, 2018

tannergooding commented Aug 17, 2018

EgorBo commented Aug 17, 2018

tannergooding commented Aug 17, 2018

Investigate perf difference between multiple SetAllVector128 and a Load + Permute sequence #27166

Investigate perf difference between multiple SetAllVector128 and a Load + Permute sequence #27166

Comments

tannergooding commented Aug 17, 2018

tannergooding commented Aug 17, 2018

tannergooding commented Aug 17, 2018

EgorBo commented Aug 17, 2018

tannergooding commented Aug 17, 2018

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166

Investigate perf difference between multiple `SetAllVector128` and a `Load + Permute` sequence #27166