Investigate perf difference between multiple SetAllVector128
and a Load + Permute
sequence
#27166
Milestone
SetAllVector128
and a Load + Permute
sequence
#27166
As per the comment here: dotnet/corefx#31779 (comment)
I would expect that an explicit
Load + four Permute
operations (for four sequential memory addresses) would be faster than (or at least as fast as) fourSetAllVector128
(which should be equivalent to four loads and four permutes).Investigate the codegen between the two to see if there is some bug blocking this optimization.
The text was updated successfully, but these errors were encountered: