Optimize multi-dimensional array access #70271

BruceForstall · 2022-06-06T01:58:42Z

Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the GT_ARR_ELEM
node (representing a a[i,j] operation, for example) to GT_ARR_OFFSET and GT_ARR_INDEX trees,
to expand the register requirements of the operation. These are then directly used to generate code.

This change moves the expansion of GT_ARR_ELEM to a new pass that follows loop optimization but precedes
Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.

The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.

GT_ARR_ELEM nodes are morphed to appropriate trees. Note that an MD array Get, Set, or Address
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().

For example, a simple 2-dimensional array access like a[i,j] looks like:

\--*  ARR_ELEM[,] byref
   +--*  LCL_VAR   ref    V00 arg0
   +--*  LCL_VAR   int    V01 arg1
   \--*  LCL_VAR   int    V02 arg2

This is replaced by:

&a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1)))

plus the appropriate i and j bounds checks.

In IR, this is:

*  ADD       byref
+--*  ADD       long
|  +--*  MUL       long
|  |  +--*  CAST      long <- uint
|  |  |  \--*  ADD       int
|  |  |     +--*  MUL       int
|  |  |     |  +--*  COMMA     int
|  |  |     |  |  +--*  ASG       int
|  |  |     |  |  |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |  |  \--*  SUB       int
|  |  |     |  |  |     +--*  LCL_VAR   int    V01 arg1
|  |  |     |  |  |     \--*  MDARR_LOWER_BOUND int    (0)
|  |  |     |  |  |        \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |  \--*  COMMA     int
|  |  |     |  |     +--*  BOUNDS_CHECK_Rng void
|  |  |     |  |     |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |     |  \--*  MDARR_LENGTH int    (0)
|  |  |     |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |     \--*  LCL_VAR   int    V04 tmp1
|  |  |     |  \--*  MDARR_LENGTH int    (1)
|  |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     \--*  COMMA     int
|  |  |        +--*  ASG       int
|  |  |        |  +--*  LCL_VAR   int    V05 tmp2
|  |  |        |  \--*  SUB       int
|  |  |        |     +--*  LCL_VAR   int    V02 arg2
|  |  |        |     \--*  MDARR_LOWER_BOUND int    (1)
|  |  |        |        \--*  LCL_VAR   ref    V00 arg0
|  |  |        \--*  COMMA     int
|  |  |           +--*  BOUNDS_CHECK_Rng void
|  |  |           |  +--*  LCL_VAR   int    V05 tmp2
|  |  |           |  \--*  MDARR_LENGTH int    (1)
|  |  |           |     \--*  LCL_VAR   ref    V00 arg0
|  |  |           \--*  LCL_VAR   int    V05 tmp2
|  |  \--*  CNS_INT   long   4
|  \--*  CNS_INT   long   32
\--*  LCL_VAR   ref    V00 arg0

before being morphed by the usual morph transformations.

Some things to consider:

MD have both a lower bound and length for each dimension (even if very few MD arrays actually have a
lower bound)
GT_MDARR_LOWER_BOUND(dim) represents the lower-bound value for a particular array dimension. The "effective
index" for a dimension is the index minus the lower bound.
GT_MDARR_LENGTH(dim) represents the length value (number of elements in a dimension) for a particular
array dimension.
The effective index is bounds checked against the dimension length.
The lower bound and length values are 32-bit signed integers (TYP_INT).
After constructing a "linearized index", the index is scaled by the array element size, and the offset from
the array object to the beginning of the array data is added.
Much of the complexity above is simply to assign temps to the various values that are used subsequently.
The index expressions are used exactly once. However, if have side effects, they need to be copied, early,
to preserve exception ordering.
Only the top-level operation adds the array object to the scaled, linearized index, to create the final
address byref. As usual, we need to be careful to not create an illegal byref by adding any partial index.
calculation.
To avoid doing unnecessary work, the importer sets the global OMF_HAS_MDARRAYREF flag if there are any
MD array expressions to expand. Also, the block flag BBF_HAS_MDARRAYREF is set to blocks where these exist,
so only those blocks are processed.

Remaining work:

Implement optEarlyProp support for MD arrays.
Implement loop cloning support for MD arrays.
(optionally) Remove old GT_ARR_OFFSET and GT_ARR_INDEX nodes and related code, as well as GT_ARR_ELEM
code used after the new expansion.

The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0. If disabled, it can be selectively enabled using
COMPlus_JitEarlyExpandMDArraysFilter=<method_set> (e.g., as specified for JitDump).

Fixes #60785.

ghost · 2022-06-06T01:58:52Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the GT_ARR_ELEM
node (representing a a[i,j] operation, for example) to GT_ARR_OFFSET and GT_ARR_INDEX trees,
to expand the register requirements of the operation. These are then directly used to generate code.

This change moves the expansion of GT_ARR_ELEM to a new pass that follows loop optimization but precedes
Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.

The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.

GT_ARR_ELEM nodes are morphed to appropriate trees. Note that an MD array Get, Set, or Address
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().

For example, a simple 2-dimensional array access like a[i,j] looks like:

\--*  ARR_ELEM[,] byref
   +--*  LCL_VAR   ref    V00 arg0
   +--*  LCL_VAR   int    V01 arg1
   \--*  LCL_VAR   int    V02 arg2

This is replaced by:

&a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1)))

plus the appropriate i and j bounds checks.

In IR, this is:

*  ADD       byref
+--*  ADD       long
|  +--*  MUL       long
|  |  +--*  CAST      long <- uint
|  |  |  \--*  ADD       int
|  |  |     +--*  MUL       int
|  |  |     |  +--*  COMMA     int
|  |  |     |  |  +--*  ASG       int
|  |  |     |  |  |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |  |  \--*  SUB       int
|  |  |     |  |  |     +--*  LCL_VAR   int    V01 arg1
|  |  |     |  |  |     \--*  MDARR_LOWER_BOUND int    (0)
|  |  |     |  |  |        \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |  \--*  COMMA     int
|  |  |     |  |     +--*  BOUNDS_CHECK_Rng void
|  |  |     |  |     |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |     |  \--*  MDARR_LENGTH int    (0)
|  |  |     |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |     \--*  LCL_VAR   int    V04 tmp1
|  |  |     |  \--*  MDARR_LENGTH int    (1)
|  |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     \--*  COMMA     int
|  |  |        +--*  ASG       int
|  |  |        |  +--*  LCL_VAR   int    V05 tmp2
|  |  |        |  \--*  SUB       int
|  |  |        |     +--*  LCL_VAR   int    V02 arg2
|  |  |        |     \--*  MDARR_LOWER_BOUND int    (1)
|  |  |        |        \--*  LCL_VAR   ref    V00 arg0
|  |  |        \--*  COMMA     int
|  |  |           +--*  BOUNDS_CHECK_Rng void
|  |  |           |  +--*  LCL_VAR   int    V05 tmp2
|  |  |           |  \--*  MDARR_LENGTH int    (1)
|  |  |           |     \--*  LCL_VAR   ref    V00 arg0
|  |  |           \--*  LCL_VAR   int    V05 tmp2
|  |  \--*  CNS_INT   long   4
|  \--*  CNS_INT   long   32
\--*  LCL_VAR   ref    V00 arg0

before being morphed by the usual morph transformations.

Some things to consider:

MD have both a lower bound and length for each dimension (even if very few MD arrays actually have a
lower bound)
GT_MDARR_LOWER_BOUND(dim) represents the lower-bound value for a particular array dimension. The "effective
index" for a dimension is the index minus the lower bound.
GT_MDARR_LENGTH(dim) represents the length value (number of elements in a dimension) for a particular
array dimension.
The effective index is bounds checked against the dimension length.
The lower bound and length values are 32-bit signed integers (TYP_INT).
After constructing a "linearized index", the index is scaled by the array element size, and the offset from
the array object to the beginning of the array data is added.
Much of the complexity above is simply to assign temps to the various values that are used subsequently.
The index expressions are used exactly once. However, if have side effects, they need to be copied, early,
to preserve exception ordering.
Only the top-level operation adds the array object to the scaled, linearized index, to create the final
address byref. As usual, we need to be careful to not create an illegal byref by adding any partial index.
calculation.
To avoid doing unnecessary work, the importer sets the global OMF_HAS_MDARRAYREF flag if there are any
MD array expressions to expand. Also, the block flag BBF_HAS_MDARRAYREF is set to blocks where these exist,
so only those blocks are processed.

Remaining work:

Implement optEarlyProp support for MD arrays.
Implement loop cloning support for MD arrays.
(optionally) Remove old GT_ARR_OFFSET and GT_ARR_INDEX nodes and related code, as well as GT_ARR_ELEM
code used after the new expansion.

The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0. If disabled, it can be selectively enabled using
COMPlus_JitEarlyExpandMDArraysFilter=<method_set> (e.g., as specified for JitDump).

Fixes #60785.

Author:	BruceForstall
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

BruceForstall · 2022-06-06T02:01:55Z

src/coreclr/jit/morph.cpp

+            // TODO: morph here? Or morph at the statement level if there are differences?
+
+            JITDUMP("fgMorphArrayOpsStmt (before remorph):\n");
+            DISPTREE(fullExpansion);
+
+            GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion);
+            DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED);
+
+            JITDUMP("fgMorphArrayOpsStmt (after remorph):\n");
+            DISPTREE(morphedTree);
+
+            *use = morphedTree;
+            JITDUMP("Morphing GT_ARR_ELEM (after)\n");
+            DISPTREE(*use);


I'm re-morphing the tree here, which seems like the most targeted thing to do. But I've introduced GT_ASG nodes, and the GTF_ASG flag needs to propagate to the root. As a result, I'm also (currently) re-morphing changed trees at the statement level, below. Should I just stop re-morphing here and let the statement-level re-morph do its thing? Or should I re-morph here, exactly what was changed, and then do something else to propagate flags up the tree?

I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.

BruceForstall · 2022-06-06T02:05:30Z

[edit] Added a 2nd run to validate. MDRomer regression evaporated. MDMulMatrix regression validated.

Some perf results from the MDBenchI/MDBenchF suite

Run 1

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
MDInProd	Job-FIBCFY	baseline	2.256 s	0.1038 s	0.1195 s	2.271 s	2.028 s	2.486 s	1.00	0.00	1000.0000	1000.0000	1000.0000	11.22 MB	1.00
MDInProd	Job-ZOKFAR	diff	1.835 s	0.0759 s	0.0874 s	1.864 s	1.644 s	1.934 s	0.81	0.05	1000.0000	1000.0000	1000.0000	11.22 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
MDInvMt	Job-FIBCFY	baseline	6.603 ms	0.1268 ms	0.1245 ms	6.578 ms	6.476 ms	6.942 ms	1.00	20.8333	20.8333	20.8333	102.57 KB	1.00
MDInvMt	Job-ZOKFAR	diff	3.033 ms	0.0582 ms	0.0670 ms	3.007 ms	2.944 ms	3.170 ms	0.46	25.0000	25.0000	25.0000	102.57 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDLLoops	Job-FIBCFY	baseline	830.6 ms	16.21 ms	15.92 ms	827.8 ms	812.4 ms	867.4 ms	1.00	0.00	3.39 MB	1.00
MDLLoops	Job-ZOKFAR	diff	783.8 ms	15.43 ms	16.51 ms	781.7 ms	758.4 ms	816.7 ms	0.94	0.02	3.39 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDRomber	Job-FIBCFY	baseline	665.3 ms	12.61 ms	11.80 ms	665.5 ms	646.5 ms	690.7 ms	1.00	0.00	2.44 KB	1.00
MDRomber	Job-ZOKFAR	diff	693.1 ms	33.63 ms	38.73 ms	687.3 ms	649.6 ms	783.0 ms	1.05	0.06	2.44 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDSqMtx	Job-FIBCFY	baseline	930.3 ms	27.35 ms	31.50 ms	917.3 ms	899.3 ms	1,008.3 ms	1.00	0.00	26.81 KB	1.00
MDSqMtx	Job-ZOKFAR	diff	846.0 ms	20.63 ms	23.76 ms	839.7 ms	827.2 ms	931.6 ms	0.91	0.03	26.81 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDAddArray2	Job-FIBCFY	baseline	19.70 ms	0.418 ms	0.482 ms	19.70 ms	18.87 ms	20.70 ms	1.00	0.00	37 B	1.00
MDAddArray2	Job-ZOKFAR	diff	16.47 ms	0.328 ms	0.378 ms	16.37 ms	16.04 ms	17.38 ms	0.84	0.03	32 B	0.86

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDArray2	Job-FIBCFY	baseline	1.676 s	0.0307 s	0.0287 s	1.667 s	1.635 s	1.738 s	1.00	0.00	8.38 KB	1.00
MDArray2	Job-ZOKFAR	diff	1.314 s	0.0209 s	0.0195 s	1.313 s	1.284 s	1.345 s	0.78	0.02	8.38 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDGeneralArray	Job-FIBCFY	baseline	15.83 ms	0.404 ms	0.465 ms	15.68 ms	15.38 ms	17.38 ms	1.00	0.00	7.94 KB	1.00
MDGeneralArray	Job-ZOKFAR	diff	11.68 ms	0.225 ms	0.250 ms	11.58 ms	11.42 ms	12.29 ms	0.74	0.03	7.93 KB	1.00

MDGeneralArray2	Job-FIBCFY	baseline	15.75 ms	0.317 ms	0.365 ms	15.65 ms	15.30 ms	16.71 ms	1.00	0.00	8.02 KB	1.00
MDGeneralArray2	Job-ZOKFAR	diff	11.70 ms	0.293 ms	0.337 ms	11.58 ms	11.27 ms	12.66 ms	0.74	0.03	8.01 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDLogicArray	Job-FIBCFY	baseline	450.0 ms	11.01 ms	12.68 ms	444.0 ms	435.9 ms	474.9 ms	1.00	0.00	10.67 KB	1.00
MDLogicArray	Job-ZOKFAR	diff	357.4 ms	14.81 ms	17.05 ms	350.8 ms	340.9 ms	391.1 ms	0.79	0.05	10.67 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDMidpoint	Job-FIBCFY	baseline	717.7 ms	18.63 ms	21.45 ms	709.1 ms	688.2 ms	752.7 ms	1.00	0.00	39.62 KB	1.00
MDMidpoint	Job-ZOKFAR	diff	530.3 ms	8.37 ms	7.83 ms	530.7 ms	519.1 ms	543.5 ms	0.74	0.03	39.34 KB	0.99

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDMulMatrix	Job-FIBCFY	baseline	731.5 ms	14.19 ms	16.35 ms	727.8 ms	713.1 ms	775.2 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix	Job-ZOKFAR	diff	1,146.3 ms	18.69 ms	17.49 ms	1,147.0 ms	1,120.9 ms	1,181.0 ms	1.56	0.03	66.52 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated	Alloc Ratio
MDNDhrystone	Job-FIBCFY	baseline	563.5 ms	12.96 ms	14.92 ms	561.6 ms	547.1 ms	593.0 ms	1.00	0.00	147000.0000	587.47 MB	1.00
MDNDhrystone	Job-ZOKFAR	diff	570.5 ms	20.20 ms	23.26 ms	559.9 ms	548.1 ms	646.2 ms	1.01	0.06	147000.0000	587.47 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDPuzzle	Job-FIBCFY	baseline	535.7 ms	18.27 ms	21.04 ms	525.3 ms	519.2 ms	592.1 ms	1.00	0.00	7.01 KB	1.00
MDPuzzle	Job-ZOKFAR	diff	479.7 ms	9.25 ms	9.09 ms	477.4 ms	469.5 ms	504.0 ms	0.90	0.02	7.01 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDXposMatrix	Job-FIBCFY	baseline	53.18 Î¼s	1.403 Î¼s	1.615 Î¼s	52.33 Î¼s	51.84 Î¼s	56.82 Î¼s	1.00	0.00	-	NA
MDXposMatrix	Job-ZOKFAR	diff	31.67 Î¼s	0.709 Î¼s	0.816 Î¼s	31.40 Î¼s	30.72 Î¼s	33.58 Î¼s	0.60	0.02	-	NA

Run 2

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
MDInProd	Job-WTGEOH	baseline	2.307 s	0.1936 s	0.2230 s	2.257 s	2.066 s	2.972 s	1.00	0.00	1000.0000	1000.0000	1000.0000	11.22 MB	1.00
MDInProd	Job-AKRICW	diff	1.842 s	0.0634 s	0.0730 s	1.856 s	1.738 s	1.981 s	0.80	0.08	1000.0000	1000.0000	1000.0000	11.22 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated	Alloc Ratio
MDInvMt	Job-WTGEOH	baseline	6.920 ms	0.7105 ms	0.8182 ms	6.569 ms	6.473 ms	9.745 ms	1.00	0.00	31.2500	31.2500	31.2500	102.58 KB	1.00
MDInvMt	Job-AKRICW	diff	3.128 ms	0.2399 ms	0.2762 ms	3.029 ms	2.946 ms	3.915 ms	0.45	0.03	31.2500	31.2500	31.2500	102.57 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDLLoops	Job-WTGEOH	baseline	831.2 ms	16.45 ms	16.89 ms	825.2 ms	809.5 ms	871.0 ms	1.00	0.00	3.39 MB	1.00
MDLLoops	Job-AKRICW	diff	800.9 ms	16.13 ms	18.58 ms	796.9 ms	773.7 ms	844.0 ms	0.96	0.02	3.39 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDRomber	Job-WTGEOH	baseline	649.2 ms	12.92 ms	13.82 ms	646.6 ms	631.2 ms	673.7 ms	1.00	0.00	2.11 KB	1.00
MDRomber	Job-AKRICW	diff	640.6 ms	11.07 ms	10.35 ms	636.2 ms	625.6 ms	662.4 ms	0.99	0.03	2.44 KB	1.16

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDSqMtx	Job-WTGEOH	baseline	925.1 ms	19.25 ms	22.17 ms	917.0 ms	900.9 ms	967.4 ms	1.00	0.00	26.53 KB	1.00
MDSqMtx	Job-AKRICW	diff	786.3 ms	15.58 ms	17.94 ms	785.1 ms	762.6 ms	820.6 ms	0.85	0.03	26.81 KB	1.01

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDAddArray2	Job-WTGEOH	baseline	19.58 ms	0.685 ms	0.789 ms	19.31 ms	18.89 ms	22.06 ms	1.00	0.00	60 B	1.00
MDAddArray2	Job-AKRICW	diff	18.67 ms	0.762 ms	0.877 ms	18.63 ms	15.61 ms	20.26 ms	0.96	0.07	24 B	0.40

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDArray2	Job-WTGEOH	baseline	1.679 s	0.0239 s	0.0223 s	1.677 s	1.652 s	1.735 s	1.00	0.00	8.38 KB	1.00
MDArray2	Job-AKRICW	diff	1.321 s	0.0231 s	0.0216 s	1.326 s	1.289 s	1.365 s	0.79	0.02	8.05 KB	0.96

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDGeneralArray	Job-WTGEOH	baseline	15.98 ms	0.597 ms	0.688 ms	15.74 ms	15.23 ms	17.64 ms	1.00	0.00	7.93 KB	1.00
MDGeneralArray	Job-AKRICW	diff	12.15 ms	0.801 ms	0.922 ms	11.81 ms	11.38 ms	14.62 ms	0.76	0.05	7.93 KB	1.00

MDGeneralArray2	Job-WTGEOH	baseline	15.83 ms	0.420 ms	0.483 ms	15.63 ms	15.30 ms	16.89 ms	1.00	0.00	8 KB	1.00
MDGeneralArray2	Job-AKRICW	diff	17.14 ms	0.490 ms	0.564 ms	17.00 ms	16.18 ms	18.42 ms	1.08	0.05	8.01 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDLogicArray	Job-WTGEOH	baseline	452.2 ms	14.84 ms	17.09 ms	445.5 ms	437.4 ms	511.8 ms	1.00	0.00	10.67 KB	1.00
MDLogicArray	Job-AKRICW	diff	351.9 ms	17.49 ms	20.14 ms	345.2 ms	338.6 ms	428.7 ms	0.78	0.05	10.67 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDMidpoint	Job-WTGEOH	baseline	707.4 ms	17.40 ms	20.04 ms	704.0 ms	682.4 ms	774.7 ms	1.00	0.00	39.29 KB	1.00
MDMidpoint	Job-AKRICW	diff	521.7 ms	22.16 ms	25.52 ms	510.1 ms	496.2 ms	593.2 ms	0.74	0.04	39.9 KB	1.02

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDMulMatrix	Job-WTGEOH	baseline	734.9 ms	14.08 ms	13.17 ms	731.8 ms	717.5 ms	768.4 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix	Job-AKRICW	diff	1,170.9 ms	36.15 ms	41.63 ms	1,169.4 ms	1,120.8 ms	1,316.6 ms	1.60	0.08	66.52 KB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated	Alloc Ratio
MDNDhrystone	Job-WTGEOH	baseline	592.8 ms	18.75 ms	21.59 ms	590.6 ms	556.5 ms	633.2 ms	1.00	0.00	147000.0000	587.47 MB	1.00
MDNDhrystone	Job-AKRICW	diff	599.5 ms	27.49 ms	31.66 ms	589.8 ms	566.8 ms	677.3 ms	1.01	0.08	147000.0000	587.47 MB	1.00

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDPuzzle	Job-WTGEOH	baseline	563.6 ms	27.75 ms	31.95 ms	558.1 ms	528.7 ms	644.7 ms	1.00	0.00	7.01 KB	1.00
MDPuzzle	Job-AKRICW	diff	506.1 ms	32.32 ms	37.22 ms	499.8 ms	467.4 ms	626.2 ms	0.90	0.07	6.68 KB	0.95

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDXposMatrix	Job-WTGEOH	baseline	56.09 Î¼s	2.774 Î¼s	3.195 Î¼s	55.35 Î¼s	52.20 Î¼s	65.46 Î¼s	1.00	0.00	4 B	1.00
MDXposMatrix	Job-AKRICW	diff	31.77 Î¼s	0.650 Î¼s	0.749 Î¼s	31.35 Î¼s	31.08 Î¼s	33.46 Î¼s	0.57	0.04	-	0.00

BruceForstall · 2022-06-06T02:08:46Z

As seen above, MDRomer has a small regression that could be investigated.

Almost all spmi asmdiffs are improvements. There are a few outlier regressions that should be investigated, all in decimaldiv:TestEntryPoint and related test code. One effect of the new expansion is the use of more temps. In this case, we go from 2952 to 6076 temps, so it's possible we go beyond our tracked optimization limits in a bad way.

BruceForstall · 2022-06-06T02:09:10Z

@AndyAyersMS @dotnet/jit-contrib PTAL

BruceForstall · 2022-06-06T02:09:56Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2022-06-06T02:10:20Z

Azure Pipelines successfully started running 3 pipeline(s).

BruceForstall · 2022-06-06T05:06:52Z

/azp run runtime-coreclr gcstress0x3-gcstress0xc

azure-pipelines · 2022-06-06T05:07:08Z

Azure Pipelines successfully started running 1 pipeline(s).

SingleAccretion · 2022-06-06T09:56:20Z

src/coreclr/jit/morph.cpp

+// GT_ARR_ELEM nodes are morphed to appropriate trees. Note that MD array `Get`, `Set`, or `Address`
+// is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
+// and replaced by IR nodes, especially GT_ARR_ELEM nodes, in impArrayAccessIntrinsic().


Not an immediate concern with this change, but I wonder if you have some ideas on how to approach adding VN support for this early expansion.

The SZ case utilizes a parser utility that tries to reconstruct whatever morph left, the (significantly) more complex MD trees look less amenable to that.

That's an excellent question, and needs more thought.

jakobbotsch · 2022-06-06T10:53:52Z

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high? Do we have some particularly crazy MD array tests there?

kunalspathak · 2022-06-06T14:09:57Z

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high

Yeah, I was looking at those too. Seems we have cases like
https://github.com/dotnet/runtime/blob/4881a639e7c3f27b5a8d2d160e234d8055333cda/src/tests/JIT/Methodical/divrem/div/r4div.cs that has high code size increase too.

AndyAyersMS · 2022-06-06T15:29:58Z

As seen above, MDRomer has a small regression that could be investigated.

MDMulMatrix also has a (big?) regression

BruceForstall · 2022-06-06T16:08:51Z

Any idea why the coreclr_tests.pmi.windows.x64.checked.mch TP impact is so high? Do we have some particularly crazy MD array tests there?

I need to investigate. Maybe related to the large size regressions in the cases I mentioned and Kunal pointed out. Over 7% on win-x64 is pretty extreme given that I doubt many tests even have MD arrays.

BruceForstall · 2022-06-06T20:42:35Z

Test failures are, AFAICT, all in baseline, or infra:

runtime

Installer Build and Test coreclr windows_x86 Debug

##[error].packages\microsoft.dotnet.arcade.sdk\7.0.0-beta.22266.1\tools\VSTest.targets(55,5): error MSB3491: (NETCORE_ENGINEERING_TELEMETRY=Build) Could not write lines to file "D:\a\_work\1\s\artifacts\log\Debug\Microsoft.NET.HostModel.ComHost.Tests_net7.0_x86.log". The process cannot access the file 'D:\a\_work\1\s\artifacts\log\Debug\Microsoft.NET.HostModel.ComHost.Tests_net7.0_x86.log' because it is being used by another process.

runtime-coreclr gcstress0x3-gcstress0xc

runtime-coreclr jitstress

Assertion failed '(constIndexOffset % elemSize) == 0' #67870

runtime-coreclr libraries-jitstress

AndyAyersMS

Looks good overall.

You might consider keeping a temp cache in fgMorphArrayOps and mark all temps as not in use after each statement.

We do this in other places (eg for struct arg passing, and importer box temps) to try and keep the total number of temps reasonable.

SSA will see these recycled temps as having many distinct lifetimes so it should not inihibit opts.

AndyAyersMS · 2022-06-06T23:31:34Z

src/coreclr/jit/importer.cpp

+            // This is only enabled when early MD expansion is set because it causes small
+            // asm diffs (only in some test cases) otherwise. The GT_ARR_ELEM lowering code "accidentally" does
+            // this cast, but the new code requires it to be explicit.
+            argVal = impImplicitIorI4Cast(argVal, TYP_INT);


Seems dangerous to only add the cast under DEBUG.

Oops; I moved the "enabling" code from DEBUG to all-flavor last-minute but didn't update this.

AndyAyersMS · 2022-06-06T23:37:04Z

src/coreclr/jit/morph.cpp

+            // TODO: morph here? Or morph at the statement level if there are differences?
+
+            JITDUMP("fgMorphArrayOpsStmt (before remorph):\n");
+            DISPTREE(fullExpansion);
+
+            GenTree* morphedTree = m_compiler->fgMorphTree(fullExpansion);
+            DBEXEC(morphedTree != fullExpansion, morphedTree->gtDebugFlags &= ~GTF_DEBUG_NODE_MORPHED);
+
+            JITDUMP("fgMorphArrayOpsStmt (after remorph):\n");
+            DISPTREE(morphedTree);
+
+            *use = morphedTree;
+            JITDUMP("Morphing GT_ARR_ELEM (after)\n");
+            DISPTREE(*use);


I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.

AndyAyersMS · 2022-06-06T23:38:43Z

src/coreclr/jit/morph.cpp

+            for (unsigned i = 0; i < arrElem->gtArrRank; i++)
+            {
+                GenTree* idx = arrElem->gtArrInds[i];
+                if ((idx->gtFlags & GTF_ALL_EFFECT) == 0)


If idx is side-effect free but nontrivial you will want to use a temp too, otherwise you might duplicate a lot of stuff and force CSE to clean up after you.

In the case without a temp, I just use the idx tree directly, so there is no copy. Reminds me that I should DEBUG_DESTROY_NODE the GT_ARR_ELEM node.

Ah, they are single use, it's the effective index that is multiple use.

AndyAyersMS

Changes LGTM.

We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.

BruceForstall · 2022-06-15T20:45:00Z

We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.

The temp cache improved CQ of MDMulMatrix (and maybe others), but it's still slower than before. So, I still need to investigate MDMulMatrix CQ, as well as checking the big asm diffs regressions for the TestEntryPoint cases (I'm guessing the temp cache helped), and check TP.

BruceForstall · 2022-06-16T01:43:07Z

MDMulMatrix:

This test has 1 doubly-nested and 6 triply-nested loops. If I split out all the loops to individual benchmarks, they are up to 25% better with this change (one has no perf change: the "lkj" loop). If I reduce the number of loop nests in the function, when there are 3 or more, this change is slower, up to 1.5x the baseline. Seems like there must be issues with quantity of IR/temps.

[Edit]

Various MDMulMatrix subset perf runs

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated	Alloc Ratio
MDMulMatrix	Job-SGUYGM	base	782.6 ms	17.72 ms	20.40 ms	775.1 ms	763.9 ms	836.6 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix	Job-KNFIHE	diff	954.6 ms	23.39 ms	26.93 ms	943.8 ms	925.1 ms	1,017.0 ms	1.22	0.04	66.52 KB	1.00
MDMulMatrix1	Job-SGUYGM	base	2.926 ms	0.0551 ms	0.0590 ms	2.911 ms	2.856 ms	3.064 ms	1.00	0.00	66.05 KB	1.00
MDMulMatrix1	Job-KNFIHE	diff	2.243 ms	0.0476 ms	0.0548 ms	2.227 ms	2.195 ms	2.427 ms	0.77	0.02	66.05 KB	1.00
MDMulMatrix2	Job-SGUYGM	base	147.9 ms	3.41 ms	3.93 ms	146.7 ms	144.6 ms	161.4 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix2	Job-KNFIHE	diff	136.4 ms	4.41 ms	5.08 ms	134.8 ms	131.8 ms	150.1 ms	0.92	0.04	66.28 KB	1.00
MDMulMatrix3	Job-SGUYGM	base	243.6 ms	6.67 ms	7.68 ms	242.6 ms	233.9 ms	266.5 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix3	Job-KNFIHE	diff	269.1 ms	4.31 ms	4.04 ms	269.7 ms	262.6 ms	276.2 ms	1.10	0.03	66.52 KB	1.00
MDMulMatrix4	Job-SGUYGM	base	351.2 ms	12.92 ms	14.88 ms	344.9 ms	337.5 ms	397.9 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix4	Job-KNFIHE	diff	537.5 ms	10.77 ms	12.40 ms	532.8 ms	521.3 ms	562.8 ms	1.53	0.08	66.52 KB	1.00
MDMulMatrix5	Job-SGUYGM	base	493.7 ms	7.41 ms	6.93 ms	494.3 ms	482.5 ms	504.8 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix5	Job-KNFIHE	diff	673.4 ms	13.15 ms	13.50 ms	673.4 ms	659.0 ms	704.0 ms	1.36	0.03	66.52 KB	1.00
MDMulMatrix6	Job-SGUYGM	base	641.1 ms	12.82 ms	13.16 ms	636.8 ms	623.9 ms	669.8 ms	1.00	0.00	66.52 KB	1.00
MDMulMatrix6	Job-KNFIHE	diff	911.3 ms	41.99 ms	48.35 ms	901.0 ms	885.4 ms	1,109.2 ms	1.42	0.09	66.52 KB	1.00
MDMulMatrix_jkl	Job-SGUYGM	base	156.1 ms	3.70 ms	4.27 ms	154.4 ms	152.5 ms	170.0 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_jkl	Job-KNFIHE	diff	124.3 ms	3.73 ms	4.29 ms	122.2 ms	121.0 ms	137.5 ms	0.80	0.04	66.28 KB	1.00
MDMulMatrix_jlk	Job-SGUYGM	base	156.0 ms	3.02 ms	3.36 ms	154.1 ms	151.4 ms	163.7 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_jlk	Job-KNFIHE	diff	122.7 ms	2.41 ms	2.25 ms	122.2 ms	119.7 ms	126.6 ms	0.79	0.03	66.28 KB	1.00
MDMulMatrix_kjl	Job-SGUYGM	base	145.9 ms	4.11 ms	4.74 ms	143.4 ms	140.9 ms	155.8 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_kjl	Job-KNFIHE	diff	123.8 ms	2.73 ms	3.14 ms	121.9 ms	120.8 ms	129.2 ms	0.85	0.04	66.28 KB	1.00
MDMulMatrix_klj	Job-SGUYGM	base	145.7 ms	3.81 ms	4.39 ms	144.5 ms	140.5 ms	156.8 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_klj	Job-KNFIHE	diff	134.5 ms	2.65 ms	2.94 ms	133.3 ms	130.8 ms	139.7 ms	0.93	0.04	66.28 KB	1.00
MDMulMatrix_ljk	Job-SGUYGM	base	145.0 ms	3.27 ms	3.76 ms	143.4 ms	141.5 ms	153.8 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_ljk	Job-KNFIHE	diff	133.6 ms	2.59 ms	2.98 ms	132.2 ms	131.4 ms	142.8 ms	0.92	0.03	66.28 KB	1.00
MDMulMatrix_lkj	Job-SGUYGM	base	145.2 ms	2.78 ms	2.60 ms	145.4 ms	140.4 ms	151.8 ms	1.00	0.00	66.28 KB	1.00
MDMulMatrix_lkj	Job-KNFIHE	diff	146.0 ms	4.48 ms	5.15 ms	144.3 ms	141.0 ms	159.7 ms	1.01	0.03	66.28 KB	1.00

BruceForstall · 2022-06-16T01:58:36Z

Size regressions:

With the temp cache implemented, all the TestEntryPoint regressions noted above become improvements:

Top method improvements (bytes):
      -30527 (-14.25% of base) : 248173.dasm - decimaldiv:TestEntryPoint():int
      -30527 (-14.25% of base) : 248183.dasm - decimalrem:TestEntryPoint():int
      -28782 (-21.19% of base) : 7577.dasm - i4rem:TestEntryPoint():int
      -27451 (-20.25% of base) : 7597.dasm - i8rem:TestEntryPoint():int
      -27055 (-19.35% of base) : 7660.dasm - u8rem:TestEntryPoint():int
      -26583 (-20.57% of base) : 7587.dasm - i8div:TestEntryPoint():int
      -26188 (-20.50% of base) : 7567.dasm - i4div:TestEntryPoint():int
      -26144 (-19.83% of base) : 7650.dasm - u8div:TestEntryPoint():int
      -25783 (-21.23% of base) : 7628.dasm - r8div:TestEntryPoint():int
      -25599 (-20.84% of base) : 7607.dasm - r4div:TestEntryPoint():int
      -22808 (-18.07% of base) : 7638.dasm - r8rem:TestEntryPoint():int
      -22729 (-17.78% of base) : 7618.dasm - r4rem:TestEntryPoint():int
       -7138 (-26.43% of base) : 13480.dasm - r4NaNdiv:TestEntryPoint():int
       -6733 (-25.26% of base) : 13485.dasm - r8NaNdiv:TestEntryPoint():int
       -6733 (-25.26% of base) : 13206.dasm - r8NaNdiv:TestEntryPoint():int
       -6664 (-24.78% of base) : 13153.dasm - r4NaNadd:TestEntryPoint():int
       -6588 (-24.39% of base) : 13163.dasm - r4NaNdiv:TestEntryPoint():int
       -6549 (-24.92% of base) : 13483.dasm - r4NaNsub:TestEntryPoint():int
       -6407 (-24.56% of base) : 13481.dasm - r4NaNmul:TestEntryPoint():int
       -6378 (-23.41% of base) : 13171.dasm - r4NaNmul:TestEntryPoint():int

Also, the improvements far outweigh the regressions, e.g., for coreclr_tests: Total bytes of delta: -586084 (-0.48 % of base).

MDMulMatrix is an outlier for how large the size regression is: 652 (44.81% of base) : 32267.dasm - Benchstone.MDBenchI.MDMulMatrix:Inner(System.Int32[,],System.Int32[,],System.Int32[,])

BruceForstall · 2022-06-16T22:16:16Z

Diffs: https://dev.azure.com/dnceng/public/_build/results?buildId=1829149&view=ms.vss-build-web.run-extensions-tab

TP diffs still shows significant regression on coreclr_tests spmi run, but that's probably just because that's where we actually have MD array accesses and the most asm code diffs. E.g., for win-x64:

code size: Total bytes of delta: -586084 (-0.48 % of base)
TP: coreclr_tests.pmi.windows.x64.checked.mch +6.22%

jakobbotsch · 2022-06-17T22:42:54Z

One thing you can consider is hacking SPMI to produce a table of functions with the # instructions executed for each context. That might help narrow into if it's expected.
We already have the # instructions executed on a per-method basis, so it should not be too hard. For example, stupid and simple thing would be to just print baseMetrics.NumExecutedInstructions here:

runtime/src/coreclr/tools/superpmi/superpmi/superpmi.cpp

Line 374 in 0d0fd75

    
           LogDebug("Method %d compiled in %fms, result %d", reader->GetMethodContextIndex(), st3.GetMilliseconds(), res);

and diffMetrics.NumExecutedInstructions here:

runtime/src/coreclr/tools/superpmi/superpmi/superpmi.cpp

Lines 396 to 397 in 0d0fd75

    
           LogDebug("Method %d compiled by JIT2 in %fms, result %d", reader->GetMethodContextIndex(), 
        
                    st4.GetMilliseconds(), res2);

and then post process this into something.

BruceForstall · 2022-06-17T22:49:48Z

One thing you can consider

Thanks for the suggestion. I was going to figure out which method contexts are affected (at all) by my change, and extract them into a separate mch, then use JitTimeLogCsv. A perfview diff of a spmi replay (of the full coreclr_tests collection) with/without my change shows a significant increase in fgMorphSmpOp and GenTreeVisitor::WalkTree, so I should look at total IR size before/after as well.

jakobbotsch · 2022-06-18T06:49:13Z

JitTimeLogCsv

Didn't know about this, looks useful. Maybe I should add the precise instruction count data in this mechanism too.

Configuration variables: 1. `COMPlus_JitEarlyExpandMDArrays`. Set to zero to disable early MD expansion. Default is 1 (enabled). 2. If `COMPlus_JitEarlyExpandMDArrays=0`, use `COMPlus_JitEarlyExpandMDArraysFilter` to selectively enable early MD expansion for a method set (e.g., syntax like `JitDump`)

BruceForstall · 2022-07-02T01:14:51Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2022-07-02T01:15:14Z

Azure Pipelines successfully started running 3 pipeline(s).

EgorBo · 2022-07-12T18:40:18Z

Improvements on Linux-x64 dotnet/perf-autofiling-issues#6721

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 6, 2022

ghost assigned BruceForstall Jun 6, 2022

BruceForstall commented Jun 6, 2022

View reviewed changes

BruceForstall requested a review from AndyAyersMS June 6, 2022 02:09

SingleAccretion reviewed Jun 6, 2022

View reviewed changes

AndyAyersMS reviewed Jun 6, 2022

View reviewed changes

AndyAyersMS approved these changes Jun 15, 2022

View reviewed changes

BruceForstall force-pushed the OptimizeMultiDimensionalArrays branch from aca4585 to 2c63c4b Compare June 16, 2022 01:44

BruceForstall closed this Jun 16, 2022

BruceForstall reopened this Jun 16, 2022

BruceForstall added 5 commits July 1, 2022 17:14

Add comments

40019df

Formatting

92c73fe

Code review feedback; add temp cache

7a14656

Fix build/merge

a0d0812

BruceForstall force-pushed the OptimizeMultiDimensionalArrays branch 2 times, most recently from b2203da to a0d0812 Compare July 2, 2022 00:32

Comments

0f7836c

BruceForstall merged commit cc0ccbe into dotnet:main Jul 5, 2022

BruceForstall deleted the OptimizeMultiDimensionalArrays branch July 5, 2022 20:49

DrewScoggins mentioned this pull request Jul 12, 2022

[Perf] Changes at 7/5/2022 10:32:06 PM #72030

Closed

AndyAyersMS mentioned this pull request Jul 20, 2022

[Perf] Changes at 7/6/2022 12:55:46 AM dotnet/perf-autofiling-issues#6759

Closed

This was referenced Jul 20, 2022

[Perf] Changes at 7/6/2022 12:55:46 AM dotnet/perf-autofiling-issues#6630

Closed

[Perf] Changes at 7/6/2022 12:55:46 AM dotnet/perf-autofiling-issues#6632

Closed

[Perf] Changes at 7/6/2022 12:55:46 AM dotnet/perf-autofiling-issues#6636

Closed

This was referenced Jul 21, 2022

[Perf] Changes at 7/5/2022 10:32:06 PM dotnet/perf-autofiling-issues#6733

Closed

[Perf] Changes at 7/5/2022 10:32:06 PM dotnet/perf-autofiling-issues#6713

Closed

JulieLeeMSFT mentioned this pull request Jul 28, 2022

What's new in .NET 7 Preview 7 [WIP] dotnet/core#7455

Closed

kunalspathak mentioned this pull request Aug 1, 2022

Regression from hoisting out of nested loop #71059

Closed

ghost locked as resolved and limited conversation to collaborators Aug 11, 2022

Optimize multi-dimensional array access #70271

Optimize multi-dimensional array access #70271

Conversation

BruceForstall commented Jun 6, 2022

ghost commented Jun 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BruceForstall commented Jun 6, 2022 • edited Loading

BruceForstall commented Jun 6, 2022

BruceForstall commented Jun 6, 2022

BruceForstall commented Jun 6, 2022

azure-pipelines bot commented Jun 6, 2022

BruceForstall commented Jun 6, 2022

azure-pipelines bot commented Jun 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented Jun 6, 2022

kunalspathak commented Jun 6, 2022

AndyAyersMS commented Jun 6, 2022

BruceForstall commented Jun 6, 2022

BruceForstall commented Jun 6, 2022

runtime

runtime-coreclr gcstress0x3-gcstress0xc

runtime-coreclr jitstress

runtime-coreclr libraries-jitstress

AndyAyersMS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BruceForstall Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS left a comment

Choose a reason for hiding this comment

BruceForstall commented Jun 15, 2022

BruceForstall commented Jun 16, 2022 • edited Loading

BruceForstall commented Jun 16, 2022

BruceForstall commented Jun 16, 2022

jakobbotsch commented Jun 17, 2022

BruceForstall commented Jun 17, 2022

jakobbotsch commented Jun 18, 2022

BruceForstall commented Jul 2, 2022

azure-pipelines bot commented Jul 2, 2022

EgorBo commented Jul 12, 2022

BruceForstall commented Jun 6, 2022 •

edited

Loading

BruceForstall Jun 7, 2022 •

edited

Loading

BruceForstall commented Jun 16, 2022 •

edited

Loading