-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT: Improving multi-dimensional array accesses #60785
Comments
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsIt is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481. Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of
The x64 code generated for the inner loop is currently:
Things to note:
Thus, we should be able to generate something like this in the inner loop:
Currently, after importation, the array access looks like:
This form persists until lowering, when it is expanded to:
There are several things to note, here:
In addition, not shown here, Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):
Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run. Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops. Thus, to improve code generation:
You can see a description of the internal data layout format in the comments for the category:cq
|
cc @dotnet/jit-contrib |
It should be possible to do even better job by expanding them as JIT intrinsics when you know the actual array type. I guess they will need to be JIT intrinsics anyway so that they can be recognized by the optimizations. |
If the JIT knows it's an MD array, then we can eliminate the code in Interestingly, GetLowerBound/GetUpperBound would still need the I actually presume that most coders don't use these functions, and assume zero-based dimensions. I would guess they still use |
If the JIT knows the actual array type that should be always the case in the situations we want to optimize, it can substitute the rank with a constant. It does not need to fetch it from MethodTable. |
To-do |
Is MDArray stored as a single contiguous memory block? If so, can use col/row major addressing formula (which takes care of non-zero LB) to linearize the problem during LIH? |
Yes, it's contiguous. The JIT does use essentially the referenced formula for constructing the index. |
Proposal
Thus, GT_ARR_INDEX and GT_ARR_OFFSET will be removed. New nodes to be added:
Nodes to be generalized: For a 2-d access
We would generate:
Note that many things in this computation are invariant or could be optimized. E.g., we could hoist invariant effective index calculations and bounds checks. Loop cloning should be augmented to support MD arrays. One additional cloning condition should be added that is specific to MD arrays: checking that the lower bounds of each dimension is zero (which is expected to be the case most of the time, and we will only optimize via cloning those cases). With that optimization, and if cloning removed bounds checks, we would see:
The This computation only computes the linear index. We also need to scale it by the array element size and add the array data offset to access the data itself. Thus, for
On xarch, this might take the form of an LEA; on arm64, there might be more computations. E.g., we might want to hoist @AndyAyersMS @dotnet/jit-contrib Comments? |
Seems reasonable. The full address expression is something like
I wonder if we should instead try and produce something like
with the expectation that hoisting will turn this into
(we'd need to be careful to ensure that |
Expressing the addressing as suggested makes sense, to expose more loop-invariant parts. (Maybe it would be nice if we had an expression optimizer that could automatically do that distribution of multiplication across loop-invariant and non-invariant parts, but perhaps it would be hard, and difficult to measure the benefit for.) Since in your example
can't be split to:
because the definition of
In the example above, leaving byref creation to the inner loop and only hoisting loop-invariant index/offset calculcation:
This might be better than what my tree above might generate, probably:
We need to compute In that case, we could have:
Considering a 3-d
or:
Here, there are three multiplications by
Note that all the expressions create legal byrefs, since the indexes are non-negative and within range (already bounds checked). |
Will be interesting to see how this plays out with #68061 |
Currently, multi-dimensional (MD) array access operations are treated as opaque to most of the JIT; they pass through the optimization pipeline untouched. Lowering expands the `GT_ARR_ELEM` node (representing a `a[i,j]` operation, for example) to `GT_ARR_OFFSET` and `GT_ARR_INDEX` trees, to expand the register requirements of the operation. These are then directly used to generate code. This change moves the expansion of `GT_ARR_ELEM` to a new pass that follows loop optimization but precedes Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes get lowered to machine-independent nodes before code generation. The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%, but there is one significant CQ regression in MDMulMatrix of over 20%. Future loop cloning, CSE, and/or LSRA work will be needed to get that back. In this change, `GT_ARR_ELEM` nodes are morphed to appropriate trees. Note that an MD array `Get`, `Set`, or `Address` operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic and replaced by IR nodes, especially `GT_ARR_ELEM` nodes, in `impArrayAccessIntrinsic()`. For example, a simple 2-dimensional array access like `a[i,j]` looks like: ``` \--* ARR_ELEM[,] byref +--* LCL_VAR ref V00 arg0 +--* LCL_VAR int V01 arg1 \--* LCL_VAR int V02 arg2 ``` This is replaced by: ``` &a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1))) ``` plus the appropriate `i` and `j` bounds checks. In IR, this is: ``` * ADD byref +--* ADD long | +--* MUL long | | +--* CAST long <- uint | | | \--* ADD int | | | +--* MUL int | | | | +--* COMMA int | | | | | +--* ASG int | | | | | | +--* LCL_VAR int V04 tmp1 | | | | | | \--* SUB int | | | | | | +--* LCL_VAR int V01 arg1 | | | | | | \--* MDARR_LOWER_BOUND int (0) | | | | | | \--* LCL_VAR ref V00 arg0 | | | | | \--* COMMA int | | | | | +--* BOUNDS_CHECK_Rng void | | | | | | +--* LCL_VAR int V04 tmp1 | | | | | | \--* MDARR_LENGTH int (0) | | | | | | \--* LCL_VAR ref V00 arg0 | | | | | \--* LCL_VAR int V04 tmp1 | | | | \--* MDARR_LENGTH int (1) | | | | \--* LCL_VAR ref V00 arg0 | | | \--* COMMA int | | | +--* ASG int | | | | +--* LCL_VAR int V05 tmp2 | | | | \--* SUB int | | | | +--* LCL_VAR int V02 arg2 | | | | \--* MDARR_LOWER_BOUND int (1) | | | | \--* LCL_VAR ref V00 arg0 | | | \--* COMMA int | | | +--* BOUNDS_CHECK_Rng void | | | | +--* LCL_VAR int V05 tmp2 | | | | \--* MDARR_LENGTH int (1) | | | | \--* LCL_VAR ref V00 arg0 | | | \--* LCL_VAR int V05 tmp2 | | \--* CNS_INT long 4 | \--* CNS_INT long 32 \--* LCL_VAR ref V00 arg0 ``` before being morphed by the usual morph transformations. Some things to consider: 1. MD arrays have both a lower bound and length for each dimension (even if very few MD arrays actually have a non-zero lower bound) 2. The new `GT_MDARR_LOWER_BOUND(dim)` node represents the lower-bound value for a particular array dimension. The "effective index" for a dimension is the index minus the lower bound. 3. The new `GT_MDARR_LENGTH(dim)` node represents the length value (number of elements in a dimension) for a particular array dimension. 4. The effective index is bounds checked against the dimension length. 5. The lower bound and length values are 32-bit signed integers (`TYP_INT`). 6. After constructing a "linearized index", the index is scaled by the array element size, and the offset from the array object to the beginning of the array data is added. 7. Much of the complexity above is simply to assign temps to the various values that are used subsequently. 8. The index expressions are used exactly once. However, if have side effects, they need to be copied, early, to preserve exception ordering. 9. Only the top-level operation adds the array object to the scaled, linearized index, to create the final address `byref`. As usual, we need to be careful to not create an illegal byref by adding any partial index. calculation. 10. To avoid doing unnecessary work, the importer sets the global `OMF_HAS_MDARRAYREF` flag if there are any MD array expressions to expand. Also, the block flag `BBF_HAS_MDARRAYREF` is set on blocks where these exist, so only those blocks are processed. Remaining work: 1. Implement `optEarlyProp` support for MD arrays. 2. Implement loop cloning support for MD arrays. 3. (optionally) Remove old `GT_ARR_OFFSET` and `GT_ARR_INDEX` nodes and related code, as well as `GT_ARR_ELEM` code used after the new expansion. 4. Implement improvements in CSE and LSRA to improve codegen for the MDMulMatrix benchmark. The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting `COMPlus_JitEarlyExpandMDArrays=0`. If disabled, it can be selectively enabled using `COMPlus_JitEarlyExpandMDArraysFilter=<method_set>` (e.g., as specified for `JitDump`). Fixes #60785.
It is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481.
Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of
n
:The x64 code generated for the inner loop is currently:
Things to note:
i
andj
) should be hoisted out of the innerk
loop. The other 2 could be removed by loop cloning.k
dimension data (a byref into the array) could be hoisted out of the inner loop.Thus, we should be able to generate something like this in the inner loop:
Currently, after importation, the array access looks like:
This form persists until lowering, when it is expanded to:
There are several things to note, here:
ARR_INDEX
node both creates an "effective index" (the program index minus the dimension's lower bounds) as well as performs a bounds check. This means that the bounds check can't be eliminated, or hoisted (assuming any optimization was done on this form), as theARR_INDEX
node is still required to create the effective index value.In addition, not shown here,
System.Array
has a number of methods used on multi-dimensional arrays, that are not well optimized, e.g., are not all inlined or treated as invariant. Specifically,Rank
,GetLowerBound
,GetUpperBound
,GetLength
.Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for
a[i][j]
, an array bounds check ona[i].Length
must be done for every varyingi
in a loop before accessinga[i][j]
. However, fora[i,j]
it may be possible to hoist thea[i,]
dimension check.For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):
Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run.
Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops.
Thus, to improve code generation:
ARR_INDEX
node into one that does a bounds check, and one that computes the effective index. To avoid duplicating the effective index calculation (needed by both parts), this might require a temp and comma, e.g.,You can see a description of the internal data layout format in the comments for the
RawArrayData
type and theCORINFO_Array
type.category:cq
theme:md-arrays
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: