Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize multi-dimensional array access #70271

Merged

Conversation

BruceForstall
Copy link
Member

Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the GT_ARR_ELEM
node (representing a a[i,j] operation, for example) to GT_ARR_OFFSET and GT_ARR_INDEX trees,
to expand the register requirements of the operation. These are then directly used to generate code.

This change moves the expansion of GT_ARR_ELEM to a new pass that follows loop optimization but precedes
Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.

The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.

GT_ARR_ELEM nodes are morphed to appropriate trees. Note that an MD array Get, Set, or Address
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().

For example, a simple 2-dimensional array access like a[i,j] looks like:

\--*  ARR_ELEM[,] byref
   +--*  LCL_VAR   ref    V00 arg0
   +--*  LCL_VAR   int    V01 arg1
   \--*  LCL_VAR   int    V02 arg2

This is replaced by:

&a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1)))

plus the appropriate i and j bounds checks.

In IR, this is:

*  ADD       byref
+--*  ADD       long
|  +--*  MUL       long
|  |  +--*  CAST      long <- uint
|  |  |  \--*  ADD       int
|  |  |     +--*  MUL       int
|  |  |     |  +--*  COMMA     int
|  |  |     |  |  +--*  ASG       int
|  |  |     |  |  |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |  |  \--*  SUB       int
|  |  |     |  |  |     +--*  LCL_VAR   int    V01 arg1
|  |  |     |  |  |     \--*  MDARR_LOWER_BOUND int    (0)
|  |  |     |  |  |        \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |  \--*  COMMA     int
|  |  |     |  |     +--*  BOUNDS_CHECK_Rng void
|  |  |     |  |     |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |     |  \--*  MDARR_LENGTH int    (0)
|  |  |     |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |     \--*  LCL_VAR   int    V04 tmp1
|  |  |     |  \--*  MDARR_LENGTH int    (1)
|  |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     \--*  COMMA     int
|  |  |        +--*  ASG       int
|  |  |        |  +--*  LCL_VAR   int    V05 tmp2
|  |  |        |  \--*  SUB       int
|  |  |        |     +--*  LCL_VAR   int    V02 arg2
|  |  |        |     \--*  MDARR_LOWER_BOUND int    (1)
|  |  |        |        \--*  LCL_VAR   ref    V00 arg0
|  |  |        \--*  COMMA     int
|  |  |           +--*  BOUNDS_CHECK_Rng void
|  |  |           |  +--*  LCL_VAR   int    V05 tmp2
|  |  |           |  \--*  MDARR_LENGTH int    (1)
|  |  |           |     \--*  LCL_VAR   ref    V00 arg0
|  |  |           \--*  LCL_VAR   int    V05 tmp2
|  |  \--*  CNS_INT   long   4
|  \--*  CNS_INT   long   32
\--*  LCL_VAR   ref    V00 arg0

before being morphed by the usual morph transformations.

Some things to consider:

  1. MD have both a lower bound and length for each dimension (even if very few MD arrays actually have a
    lower bound)
  2. GT_MDARR_LOWER_BOUND(dim) represents the lower-bound value for a particular array dimension. The "effective
    index" for a dimension is the index minus the lower bound.
  3. GT_MDARR_LENGTH(dim) represents the length value (number of elements in a dimension) for a particular
    array dimension.
  4. The effective index is bounds checked against the dimension length.
  5. The lower bound and length values are 32-bit signed integers (TYP_INT).
  6. After constructing a "linearized index", the index is scaled by the array element size, and the offset from
    the array object to the beginning of the array data is added.
  7. Much of the complexity above is simply to assign temps to the various values that are used subsequently.
  8. The index expressions are used exactly once. However, if have side effects, they need to be copied, early,
    to preserve exception ordering.
  9. Only the top-level operation adds the array object to the scaled, linearized index, to create the final
    address byref. As usual, we need to be careful to not create an illegal byref by adding any partial index.
    calculation.
  10. To avoid doing unnecessary work, the importer sets the global OMF_HAS_MDARRAYREF flag if there are any
    MD array expressions to expand. Also, the block flag BBF_HAS_MDARRAYREF is set to blocks where these exist,
    so only those blocks are processed.

Remaining work:

  1. Implement optEarlyProp support for MD arrays.
  2. Implement loop cloning support for MD arrays.
  3. (optionally) Remove old GT_ARR_OFFSET and GT_ARR_INDEX nodes and related code, as well as GT_ARR_ELEM
    code used after the new expansion.

The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0. If disabled, it can be selectively enabled using
COMPlus_JitEarlyExpandMDArraysFilter=<method_set> (e.g., as specified for JitDump).

Fixes #60785.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 6, 2022
@ghost ghost assigned BruceForstall Jun 6, 2022
@ghost
Copy link

ghost commented Jun 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the GT_ARR_ELEM
node (representing a a[i,j] operation, for example) to GT_ARR_OFFSET and GT_ARR_INDEX trees,
to expand the register requirements of the operation. These are then directly used to generate code.

This change moves the expansion of GT_ARR_ELEM to a new pass that follows loop optimization but precedes
Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.

The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.

GT_ARR_ELEM nodes are morphed to appropriate trees. Note that an MD array Get, Set, or Address
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().

For example, a simple 2-dimensional array access like a[i,j] looks like:

\--*  ARR_ELEM[,] byref
   +--*  LCL_VAR   ref    V00 arg0
   +--*  LCL_VAR   int    V01 arg1
   \--*  LCL_VAR   int    V02 arg2

This is replaced by:

&a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1)))

plus the appropriate i and j bounds checks.

In IR, this is:

*  ADD       byref
+--*  ADD       long
|  +--*  MUL       long
|  |  +--*  CAST      long <- uint
|  |  |  \--*  ADD       int
|  |  |     +--*  MUL       int
|  |  |     |  +--*  COMMA     int
|  |  |     |  |  +--*  ASG       int
|  |  |     |  |  |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |  |  \--*  SUB       int
|  |  |     |  |  |     +--*  LCL_VAR   int    V01 arg1
|  |  |     |  |  |     \--*  MDARR_LOWER_BOUND int    (0)
|  |  |     |  |  |        \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |  \--*  COMMA     int
|  |  |     |  |     +--*  BOUNDS_CHECK_Rng void
|  |  |     |  |     |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |     |  \--*  MDARR_LENGTH int    (0)
|  |  |     |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |     \--*  LCL_VAR   int    V04 tmp1
|  |  |     |  \--*  MDARR_LENGTH int    (1)
|  |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     \--*  COMMA     int
|  |  |        +--*  ASG       int
|  |  |        |  +--*  LCL_VAR   int    V05 tmp2
|  |  |        |  \--*  SUB       int
|  |  |        |     +--*  LCL_VAR   int    V02 arg2
|  |  |        |     \--*  MDARR_LOWER_BOUND int    (1)
|  |  |        |        \--*  LCL_VAR   ref    V00 arg0
|  |  |        \--*  COMMA     int
|  |  |           +--*  BOUNDS_CHECK_Rng void
|  |  |           |  +--*  LCL_VAR   int    V05 tmp2
|  |  |           |  \--*  MDARR_LENGTH int    (1)
|  |  |           |     \--*  LCL_VAR   ref    V00 arg0
|  |  |           \--*  LCL_VAR   int    V05 tmp2
|  |  \--*  CNS_INT   long   4
|  \--*  CNS_INT   long   32
\--*  LCL_VAR   ref    V00 arg0

before being morphed by the usual morph transformations.

Some things to consider:

  1. MD have both a lower bound and length for each dimension (even if very few MD arrays actually have a
    lower bound)
  2. GT_MDARR_LOWER_BOUND(dim) represents the lower-bound value for a particular array dimension. The "effective
    index" for a dimension is the index minus the lower bound.
  3. GT_MDARR_LENGTH(dim) represents the length value (number of elements in a dimension) for a particular
    array dimension.
  4. The effective index is bounds checked against the dimension length.
  5. The lower bound and length values are 32-bit signed integers (TYP_INT).
  6. After constructing a "linearized index", the index is scaled by the array element size, and the offset from
    the array object to the beginning of the array data is added.
  7. Much of the complexity above is simply to assign temps to the various values that are used subsequently.
  8. The index expressions are used exactly once. However, if have side effects, they need to be copied, early,
    to preserve exception ordering.
  9. Only the top-level operation adds the array object to the scaled, linearized index, to create the final
    address byref. As usual, we need to be careful to not create an illegal byref by adding any partial index.
    calculation.
  10. To avoid doing unnecessary work, the importer sets the global OMF_HAS_MDARRAYREF flag if there are any
    MD array expressions to expand. Also, the block flag BBF_HAS_MDARRAYREF is set to blocks where these exist,
    so only those blocks are processed.

Remaining work:

  1. Implement optEarlyProp support for MD arrays.
  2. Implement loop cloning support for MD arrays.
  3. (optionally) Remove old GT_ARR_OFFSET and GT_ARR_INDEX nodes and related code, as well as GT_ARR_ELEM
    code used after the new expansion.

The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0. If disabled, it can be selectively enabled using
COMPlus_JitEarlyExpandMDArraysFilter=<method_set> (e.g., as specified for JitDump).

Fixes #60785.

Author: BruceForstall
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines 17976 to 17989
// TODO: morph here? Or morph at the statement level if there are differences?

JITDUMP("fgMorphArrayOpsStmt (before remorph):\n");
DISPTREE(fullExpansion);

GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion);
DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED);

JITDUMP("fgMorphArrayOpsStmt (after remorph):\n");
DISPTREE(morphedTree);

*use = morphedTree;
JITDUMP("Morphing GT_ARR_ELEM (after)\n");
DISPTREE(*use);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm re-morphing the tree here, which seems like the most targeted thing to do. But I've introduced GT_ASG nodes, and the GTF_ASG flag needs to propagate to the root. As a result, I'm also (currently) re-morphing changed trees at the statement level, below. Should I just stop re-morphing here and let the statement-level re-morph do its thing? Or should I re-morph here, exactly what was changed, and then do something else to propagate flags up the tree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.

@BruceForstall
Copy link
Member Author

BruceForstall commented Jun 6, 2022

[edit] Added a 2nd run to validate. MDRomer regression evaporated. MDMulMatrix regression validated.

Some perf results from the MDBenchI/MDBenchF suite

Run 1
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
MDInProd Job-FIBCFY baseline 2.256 s 0.1038 s 0.1195 s 2.271 s 2.028 s 2.486 s 1.00 0.00 1000.0000 1000.0000 1000.0000 11.22 MB 1.00
MDInProd Job-ZOKFAR diff 1.835 s 0.0759 s 0.0874 s 1.864 s 1.644 s 1.934 s 0.81 0.05 1000.0000 1000.0000 1000.0000 11.22 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
MDInvMt Job-FIBCFY baseline 6.603 ms 0.1268 ms 0.1245 ms 6.578 ms 6.476 ms 6.942 ms 1.00 20.8333 20.8333 20.8333 102.57 KB 1.00
MDInvMt Job-ZOKFAR diff 3.033 ms 0.0582 ms 0.0670 ms 3.007 ms 2.944 ms 3.170 ms 0.46 25.0000 25.0000 25.0000 102.57 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLLoops Job-FIBCFY baseline 830.6 ms 16.21 ms 15.92 ms 827.8 ms 812.4 ms 867.4 ms 1.00 0.00 3.39 MB 1.00
MDLLoops Job-ZOKFAR diff 783.8 ms 15.43 ms 16.51 ms 781.7 ms 758.4 ms 816.7 ms 0.94 0.02 3.39 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDRomber Job-FIBCFY baseline 665.3 ms 12.61 ms 11.80 ms 665.5 ms 646.5 ms 690.7 ms 1.00 0.00 2.44 KB 1.00
MDRomber Job-ZOKFAR diff 693.1 ms 33.63 ms 38.73 ms 687.3 ms 649.6 ms 783.0 ms 1.05 0.06 2.44 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDSqMtx Job-FIBCFY baseline 930.3 ms 27.35 ms 31.50 ms 917.3 ms 899.3 ms 1,008.3 ms 1.00 0.00 26.81 KB 1.00
MDSqMtx Job-ZOKFAR diff 846.0 ms 20.63 ms 23.76 ms 839.7 ms 827.2 ms 931.6 ms 0.91 0.03 26.81 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDAddArray2 Job-FIBCFY baseline 19.70 ms 0.418 ms 0.482 ms 19.70 ms 18.87 ms 20.70 ms 1.00 0.00 37 B 1.00
MDAddArray2 Job-ZOKFAR diff 16.47 ms 0.328 ms 0.378 ms 16.37 ms 16.04 ms 17.38 ms 0.84 0.03 32 B 0.86
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDArray2 Job-FIBCFY baseline 1.676 s 0.0307 s 0.0287 s 1.667 s 1.635 s 1.738 s 1.00 0.00 8.38 KB 1.00
MDArray2 Job-ZOKFAR diff 1.314 s 0.0209 s 0.0195 s 1.313 s 1.284 s 1.345 s 0.78 0.02 8.38 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDGeneralArray Job-FIBCFY baseline 15.83 ms 0.404 ms 0.465 ms 15.68 ms 15.38 ms 17.38 ms 1.00 0.00 7.94 KB 1.00
MDGeneralArray Job-ZOKFAR diff 11.68 ms 0.225 ms 0.250 ms 11.58 ms 11.42 ms 12.29 ms 0.74 0.03 7.93 KB 1.00
MDGeneralArray2 Job-FIBCFY baseline 15.75 ms 0.317 ms 0.365 ms 15.65 ms 15.30 ms 16.71 ms 1.00 0.00 8.02 KB 1.00
MDGeneralArray2 Job-ZOKFAR diff 11.70 ms 0.293 ms 0.337 ms 11.58 ms 11.27 ms 12.66 ms 0.74 0.03 8.01 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLogicArray Job-FIBCFY baseline 450.0 ms 11.01 ms 12.68 ms 444.0 ms 435.9 ms 474.9 ms 1.00 0.00 10.67 KB 1.00
MDLogicArray Job-ZOKFAR diff 357.4 ms 14.81 ms 17.05 ms 350.8 ms 340.9 ms 391.1 ms 0.79 0.05 10.67 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDMidpoint Job-FIBCFY baseline 717.7 ms 18.63 ms 21.45 ms 709.1 ms 688.2 ms 752.7 ms 1.00 0.00 39.62 KB 1.00
MDMidpoint Job-ZOKFAR diff 530.3 ms 8.37 ms 7.83 ms 530.7 ms 519.1 ms 543.5 ms 0.74 0.03 39.34 KB 0.99
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDMulMatrix Job-FIBCFY baseline 731.5 ms 14.19 ms 16.35 ms 727.8 ms 713.1 ms 775.2 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix Job-ZOKFAR diff 1,146.3 ms 18.69 ms 17.49 ms 1,147.0 ms 1,120.9 ms 1,181.0 ms 1.56 0.03 66.52 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated Alloc Ratio
MDNDhrystone Job-FIBCFY baseline 563.5 ms 12.96 ms 14.92 ms 561.6 ms 547.1 ms 593.0 ms 1.00 0.00 147000.0000 587.47 MB 1.00
MDNDhrystone Job-ZOKFAR diff 570.5 ms 20.20 ms 23.26 ms 559.9 ms 548.1 ms 646.2 ms 1.01 0.06 147000.0000 587.47 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDPuzzle Job-FIBCFY baseline 535.7 ms 18.27 ms 21.04 ms 525.3 ms 519.2 ms 592.1 ms 1.00 0.00 7.01 KB 1.00
MDPuzzle Job-ZOKFAR diff 479.7 ms 9.25 ms 9.09 ms 477.4 ms 469.5 ms 504.0 ms 0.90 0.02 7.01 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDXposMatrix Job-FIBCFY baseline 53.18 μs 1.403 μs 1.615 μs 52.33 μs 51.84 μs 56.82 μs 1.00 0.00 - NA
MDXposMatrix Job-ZOKFAR diff 31.67 μs 0.709 μs 0.816 μs 31.40 μs 30.72 μs 33.58 μs 0.60 0.02 - NA
Run 2
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
MDInProd Job-WTGEOH baseline 2.307 s 0.1936 s 0.2230 s 2.257 s 2.066 s 2.972 s 1.00 0.00 1000.0000 1000.0000 1000.0000 11.22 MB 1.00
MDInProd Job-AKRICW diff 1.842 s 0.0634 s 0.0730 s 1.856 s 1.738 s 1.981 s 0.80 0.08 1000.0000 1000.0000 1000.0000 11.22 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Alloc Ratio
MDInvMt Job-WTGEOH baseline 6.920 ms 0.7105 ms 0.8182 ms 6.569 ms 6.473 ms 9.745 ms 1.00 0.00 31.2500 31.2500 31.2500 102.58 KB 1.00
MDInvMt Job-AKRICW diff 3.128 ms 0.2399 ms 0.2762 ms 3.029 ms 2.946 ms 3.915 ms 0.45 0.03 31.2500 31.2500 31.2500 102.57 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLLoops Job-WTGEOH baseline 831.2 ms 16.45 ms 16.89 ms 825.2 ms 809.5 ms 871.0 ms 1.00 0.00 3.39 MB 1.00
MDLLoops Job-AKRICW diff 800.9 ms 16.13 ms 18.58 ms 796.9 ms 773.7 ms 844.0 ms 0.96 0.02 3.39 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDRomber Job-WTGEOH baseline 649.2 ms 12.92 ms 13.82 ms 646.6 ms 631.2 ms 673.7 ms 1.00 0.00 2.11 KB 1.00
MDRomber Job-AKRICW diff 640.6 ms 11.07 ms 10.35 ms 636.2 ms 625.6 ms 662.4 ms 0.99 0.03 2.44 KB 1.16
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDSqMtx Job-WTGEOH baseline 925.1 ms 19.25 ms 22.17 ms 917.0 ms 900.9 ms 967.4 ms 1.00 0.00 26.53 KB 1.00
MDSqMtx Job-AKRICW diff 786.3 ms 15.58 ms 17.94 ms 785.1 ms 762.6 ms 820.6 ms 0.85 0.03 26.81 KB 1.01
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDAddArray2 Job-WTGEOH baseline 19.58 ms 0.685 ms 0.789 ms 19.31 ms 18.89 ms 22.06 ms 1.00 0.00 60 B 1.00
MDAddArray2 Job-AKRICW diff 18.67 ms 0.762 ms 0.877 ms 18.63 ms 15.61 ms 20.26 ms 0.96 0.07 24 B 0.40
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDArray2 Job-WTGEOH baseline 1.679 s 0.0239 s 0.0223 s 1.677 s 1.652 s 1.735 s 1.00 0.00 8.38 KB 1.00
MDArray2 Job-AKRICW diff 1.321 s 0.0231 s 0.0216 s 1.326 s 1.289 s 1.365 s 0.79 0.02 8.05 KB 0.96
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDGeneralArray Job-WTGEOH baseline 15.98 ms 0.597 ms 0.688 ms 15.74 ms 15.23 ms 17.64 ms 1.00 0.00 7.93 KB 1.00
MDGeneralArray Job-AKRICW diff 12.15 ms 0.801 ms 0.922 ms 11.81 ms 11.38 ms 14.62 ms 0.76 0.05 7.93 KB 1.00
MDGeneralArray2 Job-WTGEOH baseline 15.83 ms 0.420 ms 0.483 ms 15.63 ms 15.30 ms 16.89 ms 1.00 0.00 8 KB 1.00
MDGeneralArray2 Job-AKRICW diff 17.14 ms 0.490 ms 0.564 ms 17.00 ms 16.18 ms 18.42 ms 1.08 0.05 8.01 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDLogicArray Job-WTGEOH baseline 452.2 ms 14.84 ms 17.09 ms 445.5 ms 437.4 ms 511.8 ms 1.00 0.00 10.67 KB 1.00
MDLogicArray Job-AKRICW diff 351.9 ms 17.49 ms 20.14 ms 345.2 ms 338.6 ms 428.7 ms 0.78 0.05 10.67 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDMidpoint Job-WTGEOH baseline 707.4 ms 17.40 ms 20.04 ms 704.0 ms 682.4 ms 774.7 ms 1.00 0.00 39.29 KB 1.00
MDMidpoint Job-AKRICW diff 521.7 ms 22.16 ms 25.52 ms 510.1 ms 496.2 ms 593.2 ms 0.74 0.04 39.9 KB 1.02
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDMulMatrix Job-WTGEOH baseline 734.9 ms 14.08 ms 13.17 ms 731.8 ms 717.5 ms 768.4 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix Job-AKRICW diff 1,170.9 ms 36.15 ms 41.63 ms 1,169.4 ms 1,120.8 ms 1,316.6 ms 1.60 0.08 66.52 KB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated Alloc Ratio
MDNDhrystone Job-WTGEOH baseline 592.8 ms 18.75 ms 21.59 ms 590.6 ms 556.5 ms 633.2 ms 1.00 0.00 147000.0000 587.47 MB 1.00
MDNDhrystone Job-AKRICW diff 599.5 ms 27.49 ms 31.66 ms 589.8 ms 566.8 ms 677.3 ms 1.01 0.08 147000.0000 587.47 MB 1.00
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDPuzzle Job-WTGEOH baseline 563.6 ms 27.75 ms 31.95 ms 558.1 ms 528.7 ms 644.7 ms 1.00 0.00 7.01 KB 1.00
MDPuzzle Job-AKRICW diff 506.1 ms 32.32 ms 37.22 ms 499.8 ms 467.4 ms 626.2 ms 0.90 0.07 6.68 KB 0.95
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDXposMatrix Job-WTGEOH baseline 56.09 μs 2.774 μs 3.195 μs 55.35 μs 52.20 μs 65.46 μs 1.00 0.00 4 B 1.00
MDXposMatrix Job-AKRICW diff 31.77 μs 0.650 μs 0.749 μs 31.35 μs 31.08 μs 33.46 μs 0.57 0.04 - 0.00

@BruceForstall
Copy link
Member Author

As seen above, MDRomer has a small regression that could be investigated.

Almost all spmi asmdiffs are improvements. There are a few outlier regressions that should be investigated, all in decimaldiv:TestEntryPoint and related test code. One effect of the new expansion is the use of more temps. In this case, we go from 2952 to 6076 temps, so it's possible we go beyond our tracked optimization limits in a bad way.

@BruceForstall
Copy link
Member Author

@AndyAyersMS @dotnet/jit-contrib PTAL

@BruceForstall BruceForstall requested a review from AndyAyersMS June 6, 2022 02:09
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Comment on lines +18008 to +17796
// GT_ARR_ELEM nodes are morphed to appropriate trees. Note that MD array `Get`, `Set`, or `Address`
// is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
// and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an immediate concern with this change, but I wonder if you have some ideas on how to approach adding VN support for this early expansion.

The SZ case utilizes a parser utility that tries to reconstruct whatever morph left, the (significantly) more complex MD trees look less amenable to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent question, and needs more thought.

@jakobbotsch
Copy link
Member

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high? Do we have some particularly crazy MD array tests there?

@kunalspathak
Copy link
Member

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high

Yeah, I was looking at those too. Seems we have cases like
https://github.com/dotnet/runtime/blob/4881a639e7c3f27b5a8d2d160e234d8055333cda/src/tests/JIT/Methodical/divrem/div/r4div.cs that has high code size increase too.

image

@AndyAyersMS
Copy link
Member

As seen above, MDRomer has a small regression that could be investigated.

MDMulMatrix also has a (big?) regression

@BruceForstall
Copy link
Member Author

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high? Do we have some particularly crazy MD array tests there?

I need to investigate. Maybe related to the large size regressions in the cases I mentioned and Kunal pointed out. Over 7% on win-x64 is pretty extreme given that I doubt many tests even have MD arrays.

@BruceForstall
Copy link
Member Author

Test failures are, AFAICT, all in baseline, or infra:

runtime

  • Installer Build and Test coreclr windows_x86 Debug
##[error].packages\microsoft.dotnet.arcade.sdk\7.0.0-beta.22266.1\tools\VSTest.targets(55,5): error MSB3491: (NETCORE_ENGINEERING_TELEMETRY=Build) Could not write lines to file "D:\a\_work\1\s\artifacts\log\Debug\Microsoft.NET.HostModel.ComHost.Tests_net7.0_x86.log". The process cannot access the file 'D:\a\_work\1\s\artifacts\log\Debug\Microsoft.NET.HostModel.ComHost.Tests_net7.0_x86.log' because it is being used by another process.

runtime-coreclr gcstress0x3-gcstress0xc

runtime-coreclr jitstress

runtime-coreclr libraries-jitstress

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

You might consider keeping a temp cache in fgMorphArrayOps and mark all temps as not in use after each statement.

We do this in other places (eg for struct arg passing, and importer box temps) to try and keep the total number of temps reasonable.

SSA will see these recycled temps as having many distinct lifetimes so it should not inihibit opts.

// This is only enabled when early MD expansion is set because it causes small
// asm diffs (only in some test cases) otherwise. The GT_ARR_ELEM lowering code "accidentally" does
// this cast, but the new code requires it to be explicit.
argVal = impImplicitIorI4Cast(argVal, TYP_INT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems dangerous to only add the cast under DEBUG.

Copy link
Member Author

@BruceForstall BruceForstall Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops; I moved the "enabling" code from DEBUG to all-flavor last-minute but didn't update this.

Comment on lines 17976 to 17989
// TODO: morph here? Or morph at the statement level if there are differences?

JITDUMP("fgMorphArrayOpsStmt (before remorph):\n");
DISPTREE(fullExpansion);

GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion);
DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED);

JITDUMP("fgMorphArrayOpsStmt (after remorph):\n");
DISPTREE(morphedTree);

*use = morphedTree;
JITDUMP("Morphing GT_ARR_ELEM (after)\n");
DISPTREE(*use);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.

for (unsigned i = 0; i < arrElem->gtArrRank; i++)
{
GenTree* idx = arrElem->gtArrInds[i];
if ((idx->gtFlags & GTF_ALL_EFFECT) == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idx is side-effect free but nontrivial you will want to use a temp too, otherwise you might duplicate a lot of stuff and force CSE to clean up after you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case without a temp, I just use the idx tree directly, so there is no copy. Reminds me that I should DEBUG_DESTROY_NODE the GT_ARR_ELEM node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, they are single use, it's the effective index that is multiple use.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.

@BruceForstall
Copy link
Member Author

We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.

The temp cache improved CQ of MDMulMatrix (and maybe others), but it's still slower than before. So, I still need to investigate MDMulMatrix CQ, as well as checking the big asm diffs regressions for the TestEntryPoint cases (I'm guessing the temp cache helped), and check TP.

@BruceForstall
Copy link
Member Author

BruceForstall commented Jun 16, 2022

MDMulMatrix:

This test has 1 doubly-nested and 6 triply-nested loops. If I split out all the loops to individual benchmarks, they are up to 25% better with this change (one has no perf change: the "lkj" loop). If I reduce the number of loop nests in the function, when there are 3 or more, this change is slower, up to 1.5x the baseline. Seems like there must be issues with quantity of IR/temps.

[Edit]

Various MDMulMatrix subset perf runs
Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
MDMulMatrix Job-SGUYGM base 782.6 ms 17.72 ms 20.40 ms 775.1 ms 763.9 ms 836.6 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix Job-KNFIHE diff 954.6 ms 23.39 ms 26.93 ms 943.8 ms 925.1 ms 1,017.0 ms 1.22 0.04 66.52 KB 1.00
MDMulMatrix1 Job-SGUYGM base 2.926 ms 0.0551 ms 0.0590 ms 2.911 ms 2.856 ms 3.064 ms 1.00 0.00 66.05 KB 1.00
MDMulMatrix1 Job-KNFIHE diff 2.243 ms 0.0476 ms 0.0548 ms 2.227 ms 2.195 ms 2.427 ms 0.77 0.02 66.05 KB 1.00
MDMulMatrix2 Job-SGUYGM base 147.9 ms 3.41 ms 3.93 ms 146.7 ms 144.6 ms 161.4 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix2 Job-KNFIHE diff 136.4 ms 4.41 ms 5.08 ms 134.8 ms 131.8 ms 150.1 ms 0.92 0.04 66.28 KB 1.00
MDMulMatrix3 Job-SGUYGM base 243.6 ms 6.67 ms 7.68 ms 242.6 ms 233.9 ms 266.5 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix3 Job-KNFIHE diff 269.1 ms 4.31 ms 4.04 ms 269.7 ms 262.6 ms 276.2 ms 1.10 0.03 66.52 KB 1.00
MDMulMatrix4 Job-SGUYGM base 351.2 ms 12.92 ms 14.88 ms 344.9 ms 337.5 ms 397.9 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix4 Job-KNFIHE diff 537.5 ms 10.77 ms 12.40 ms 532.8 ms 521.3 ms 562.8 ms 1.53 0.08 66.52 KB 1.00
MDMulMatrix5 Job-SGUYGM base 493.7 ms 7.41 ms 6.93 ms 494.3 ms 482.5 ms 504.8 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix5 Job-KNFIHE diff 673.4 ms 13.15 ms 13.50 ms 673.4 ms 659.0 ms 704.0 ms 1.36 0.03 66.52 KB 1.00
MDMulMatrix6 Job-SGUYGM base 641.1 ms 12.82 ms 13.16 ms 636.8 ms 623.9 ms 669.8 ms 1.00 0.00 66.52 KB 1.00
MDMulMatrix6 Job-KNFIHE diff 911.3 ms 41.99 ms 48.35 ms 901.0 ms 885.4 ms 1,109.2 ms 1.42 0.09 66.52 KB 1.00
MDMulMatrix_jkl Job-SGUYGM base 156.1 ms 3.70 ms 4.27 ms 154.4 ms 152.5 ms 170.0 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_jkl Job-KNFIHE diff 124.3 ms 3.73 ms 4.29 ms 122.2 ms 121.0 ms 137.5 ms 0.80 0.04 66.28 KB 1.00
MDMulMatrix_jlk Job-SGUYGM base 156.0 ms 3.02 ms 3.36 ms 154.1 ms 151.4 ms 163.7 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_jlk Job-KNFIHE diff 122.7 ms 2.41 ms 2.25 ms 122.2 ms 119.7 ms 126.6 ms 0.79 0.03 66.28 KB 1.00
MDMulMatrix_kjl Job-SGUYGM base 145.9 ms 4.11 ms 4.74 ms 143.4 ms 140.9 ms 155.8 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_kjl Job-KNFIHE diff 123.8 ms 2.73 ms 3.14 ms 121.9 ms 120.8 ms 129.2 ms 0.85 0.04 66.28 KB 1.00
MDMulMatrix_klj Job-SGUYGM base 145.7 ms 3.81 ms 4.39 ms 144.5 ms 140.5 ms 156.8 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_klj Job-KNFIHE diff 134.5 ms 2.65 ms 2.94 ms 133.3 ms 130.8 ms 139.7 ms 0.93 0.04 66.28 KB 1.00
MDMulMatrix_ljk Job-SGUYGM base 145.0 ms 3.27 ms 3.76 ms 143.4 ms 141.5 ms 153.8 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_ljk Job-KNFIHE diff 133.6 ms 2.59 ms 2.98 ms 132.2 ms 131.4 ms 142.8 ms 0.92 0.03 66.28 KB 1.00
MDMulMatrix_lkj Job-SGUYGM base 145.2 ms 2.78 ms 2.60 ms 145.4 ms 140.4 ms 151.8 ms 1.00 0.00 66.28 KB 1.00
MDMulMatrix_lkj Job-KNFIHE diff 146.0 ms 4.48 ms 5.15 ms 144.3 ms 141.0 ms 159.7 ms 1.01 0.03 66.28 KB 1.00

@BruceForstall BruceForstall force-pushed the OptimizeMultiDimensionalArrays branch from aca4585 to 2c63c4b Compare June 16, 2022 01:44
@BruceForstall
Copy link
Member Author

Size regressions:

With the temp cache implemented, all the TestEntryPoint regressions noted above become improvements:

Top method improvements (bytes):
      -30527 (-14.25% of base) : 248173.dasm - decimaldiv:TestEntryPoint():int
      -30527 (-14.25% of base) : 248183.dasm - decimalrem:TestEntryPoint():int
      -28782 (-21.19% of base) : 7577.dasm - i4rem:TestEntryPoint():int
      -27451 (-20.25% of base) : 7597.dasm - i8rem:TestEntryPoint():int
      -27055 (-19.35% of base) : 7660.dasm - u8rem:TestEntryPoint():int
      -26583 (-20.57% of base) : 7587.dasm - i8div:TestEntryPoint():int
      -26188 (-20.50% of base) : 7567.dasm - i4div:TestEntryPoint():int
      -26144 (-19.83% of base) : 7650.dasm - u8div:TestEntryPoint():int
      -25783 (-21.23% of base) : 7628.dasm - r8div:TestEntryPoint():int
      -25599 (-20.84% of base) : 7607.dasm - r4div:TestEntryPoint():int
      -22808 (-18.07% of base) : 7638.dasm - r8rem:TestEntryPoint():int
      -22729 (-17.78% of base) : 7618.dasm - r4rem:TestEntryPoint():int
       -7138 (-26.43% of base) : 13480.dasm - r4NaNdiv:TestEntryPoint():int
       -6733 (-25.26% of base) : 13485.dasm - r8NaNdiv:TestEntryPoint():int
       -6733 (-25.26% of base) : 13206.dasm - r8NaNdiv:TestEntryPoint():int
       -6664 (-24.78% of base) : 13153.dasm - r4NaNadd:TestEntryPoint():int
       -6588 (-24.39% of base) : 13163.dasm - r4NaNdiv:TestEntryPoint():int
       -6549 (-24.92% of base) : 13483.dasm - r4NaNsub:TestEntryPoint():int
       -6407 (-24.56% of base) : 13481.dasm - r4NaNmul:TestEntryPoint():int
       -6378 (-23.41% of base) : 13171.dasm - r4NaNmul:TestEntryPoint():int

Also, the improvements far outweigh the regressions, e.g., for coreclr_tests: Total bytes of delta: -586084 (-0.48 % of base).

MDMulMatrix is an outlier for how large the size regression is: 652 (44.81% of base) : 32267.dasm - Benchstone.MDBenchI.MDMulMatrix:Inner(System.Int32[,],System.Int32[,],System.Int32[,])

@BruceForstall
Copy link
Member Author

Diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1829149&view=ms.vss-build-web.run-extensions-tab

TP diffs still shows significant regression on coreclr_tests spmi run, but that's probably just because that's where we actually have MD array accesses and the most asm code diffs. E.g., for win-x64:

  • code size: Total bytes of delta: -586084 (-0.48 % of base)
  • TP: coreclr_tests.pmi.windows.x64.checked.mch +6.22%

@jakobbotsch
Copy link
Member

One thing you can consider is hacking SPMI to produce a table of functions with the # instructions executed for each context. That might help narrow into if it's expected.
We already have the # instructions executed on a per-method basis, so it should not be too hard. For example, stupid and simple thing would be to just print baseMetrics.NumExecutedInstructions here:

LogDebug("Method %d compiled in %fms, result %d", reader->GetMethodContextIndex(), st3.GetMilliseconds(), res);

and diffMetrics.NumExecutedInstructions here:
LogDebug("Method %d compiled by JIT2 in %fms, result %d", reader->GetMethodContextIndex(),
st4.GetMilliseconds(), res2);

and then post process this into something.

@BruceForstall
Copy link
Member Author

One thing you can consider

Thanks for the suggestion. I was going to figure out which method contexts are affected (at all) by my change, and extract them into a separate mch, then use JitTimeLogCsv. A perfview diff of a spmi replay (of the full coreclr_tests collection) with/without my change shows a significant increase in fgMorphSmpOp and GenTreeVisitor::WalkTree, so I should look at total IR size before/after as well.

@jakobbotsch
Copy link
Member

JitTimeLogCsv

Didn't know about this, looks useful. Maybe I should add the precise instruction count data in this mechanism too.

Configuration variables:
1. `COMPlus_JitEarlyExpandMDArrays`. Set to zero to disable early MD expansion. Default is 1 (enabled).
2. If `COMPlus_JitEarlyExpandMDArrays=0`, use `COMPlus_JitEarlyExpandMDArraysFilter` to selectively
enable early MD expansion for a method set (e.g., syntax like `JitDump`)
@BruceForstall BruceForstall force-pushed the OptimizeMultiDimensionalArrays branch 2 times, most recently from b2203da to a0d0812 Compare July 2, 2022 00:32
@BruceForstall
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@EgorBo
Copy link
Member

EgorBo commented Jul 12, 2022

Improvements on Linux-x64 dotnet/perf-autofiling-issues#6721

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RyuJIT: Improving multi-dimensional array accesses
6 participants