-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounded Dynamism RFC #194
Bounded Dynamism RFC #194
Changes from all commits
e264452
43f028a
c70c6a9
4e65d6d
e8bf5cd
421a553
5526dc2
570d28c
4a2fe75
a8addb2
71d0175
67ee850
71ad976
db30790
d10d654
c01f568
120449b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,256 @@ | ||
# StableHLO Bounded Dynamism | ||
|
||
[StableHLO](https://github.com/openxla/stablehlo) is an operation set that | ||
expresses ML computations. It has been originally bootstrapped from the [MHLO | ||
dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), | ||
including inheriting the type and some ops. This RFC aims to describe the | ||
current status and rationale for bounded dynamism constructs in StableHLO. In | ||
particular, this RFC doesn’t propose any further changes to the current state. | ||
|
||
Bounded dynamism allows programs to represent the maximum runtime size that a | ||
particular dynamic dimension of a tensor can have. This makes it possible to run | ||
such programs on platforms that don't support dynamic tensors but could support | ||
it if the upper bounds of tensor dimensions are known at the compilation time. | ||
Applications include: | ||
|
||
* Real time inference without having to wait for accumulation up to a particular | ||
batch size. | ||
* Programs whose intermediate tensor shapes depend on the operands. For example, | ||
[`stablehlo.dynamic_broadcast_in_dim`](https://github.com/openxla/stablehlo/blob/ff55f9346d54e9e38de807a79f8ae03faffda274/stablehlo/dialect/StablehloOps.td#L1838) | ||
op but with statically known upper bounds of `output_dimensions` operand. | ||
* Bounded dynamism can also open up performance optimizations opportunities. | ||
|
||
## Non Goals | ||
|
||
* Provide [value inference](https://github.com/openxla/xla/blob/9e05932a2ceadea080dc9494cfe9d735f94c4e68/xla/client/value_inference.h) | ||
like utility for producers that want to generate the `set_dimension_size` ops. | ||
Value inference depends on constant folding for StableHLO ops which is a work | ||
in progress currently. There will be separate RFC for value inference subject | ||
to separate approval. Note that producers that generate unbounded programs | ||
don't need this | ||
in StableHLO. | ||
* Provide a transformation that converts StableHLO programs to bounded StableHLO | ||
programs. However, there is a plan to have such a conversion in MHLO, although | ||
the details of this are out of scope of this RFC. | ||
|
||
## Detailed Proposal | ||
|
||
smit-hinsu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
### (P1) Bounded tensor type using the encoding field in the RankedTensorType | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets start with adding a "wrapper" class so that one can use LLVM style RTTI (e.g., dyn_cast) and then use its convenience methods. In particular this would allow flipping extension attribute to an interface later if needed without changes in users. The implementation details of how stored and actual type remains the same, but C++ API is more shielded from underlying [in particular I do want to be able to reference a context as in value inference above] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice idea! Filed #812 |
||
|
||
Bounds of a dynamic tensor are represented using the `TypeExtensionsAttr` in the | ||
`RankedTensorType` encoding field. Bounds in `TypeExtensionsAttr` is an | ||
`int64_t` array of size equal to rank of the tensor. Values corresponding to | ||
static dimensions must be `ShapedType::kDynamicSize` which is printed as `?` in | ||
the IR. Disallowing a static bound value for static dimensions makes the IR | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative is to only represent bounds for dynamic dims (e.g., size of bounds == number of dynamic dims). Now XLA today uses bounds here to effectively represent a view into a padded structure (that and the API implementations thereof are implementation details of the interpretations there rather than represented in StableHLO or type system) and in that world padding of static shapes make sense. As the padded values represent the shape of the underlying memory. Now Tensor type here does not represent memory, but just to say one can still have dynamic or static dims in tensor type while having static bounds and it would make sense for some users. (its not a use case I'm particularly interested in as those should be memrefs as there is explicit memory). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we went with the simple list because it will provide easier accessor but that doesn't have to part of the IR given your comment about wrapper that could provide other views. I think we should at least weigh this. |
||
canonical and makes it possible to infer that the dimension is dynamic if the | ||
bound value is static. | ||
|
||
The following type represents a 2D tensor, with the size of the 0th dimension | ||
being up to 3 and the size of the 1st dimension being exactly 5: | ||
|
||
```mlir | ||
tensor<?x5xf32, #stablehlo.type_extensions<bounds = [3, ?]>> | ||
``` | ||
|
||
Type compatibility in StableHLO also checks for compatibility of the bounds. | ||
For example, the example type above is compatible with `tensor<2x5xf32>` but not | ||
with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first | ||
dimension. | ||
|
||
```mlir | ||
func.func @bounds_compatibility(%arg0: tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>>, | ||
%arg1: tensor<?xf32, #stablehlo.type_extensions<bounds = [2]>>, | ||
%arg2: tensor<?xf32, #stablehlo.type_extensions<bounds = [4]>>, | ||
%arg3: tensor<2xf32>, | ||
%arg4: tensor<4xf32>) { | ||
// %arg0 is compatible with %arg1, %arg2 and %arg3 as bounded types could have | ||
// tensor<2xf32> type during runtime. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same could be said for a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
%0 = "stablehlo.add"(%arg0, %arg1) : (tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>>, tensor<?xf32, #stablehlo.type_extensions<bounds = [2]>>) -> tensor<?xf32, #stablehlo.type_extensions<bounds = [2]>> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From type inference point of view, this op here and its behavior effectively adds constraint that arg0 is bounded by 2 and the next does the same for arg2, and 70 makes it required to be. So given elementwise constraint on add we can therefore backpropagate that these aren't actually dynamic but all are There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we could do that. For verification, we end up only checking the direct and not doing aggressive analysis to infer more information but that can be done separately. |
||
%1 = "stablehlo.add"(%arg0, %arg2) : (tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>>, tensor<?xf32, #stablehlo.type_extensions<bounds = [4]>>) -> tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>> | ||
%2 = "stablehlo.add"(%arg0, %arg3) : (tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>>, tensor<2xf32>) -> tensor<2xf32> | ||
|
||
// This is illegal as operands have incompatible types. %arg0 can either be | ||
// tensor<0xf32>, tensor<1xf32>, tensor<2xf32> or tensor<3xf32> at runtime, | ||
// none of these are compatible with tensor<4xf32> | ||
%3 = "stablehlo.add"(%arg0, %arg4) : (tensor<?xf32, #stablehlo.type_extensions<bounds = [3]>>, tensor<4xf32>) -> tensor<*xf32> | ||
func.return | ||
} | ||
``` | ||
|
||
Currently, the StableHLO dialect uses the MLIR ranked tensor type to represent | ||
bounds. In the future we plan to introduce StableHLO type which supports bounds, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where it is good if you used the little wrapper class approach above to shield C++ changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Filed #812 |
||
along with a custom pretty printing format. There will be a separate RFC on | ||
this. Also, the proposal will follow StableHLO backward compatibility policies | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be good to show how function calls will be handled. E.g., are we keeping to the constraints imposed by func.func and func.call ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, that is imposed by func op and we are not making any changes there. |
||
so it is safe to use `TypeExtensionsAttr` currently. | ||
|
||
### (P2) StableHLO op semantics with bounded operands or results | ||
|
||
All ops that support dynamic operands or results can have bounds specified for | ||
them. However, the result types need to be compatible with the inferred result | ||
smit-hinsu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
types. This allows result types to be more generic or specific as long as it is | ||
compatible with the inferred type. | ||
|
||
Separately, the StableHLO specification will be updated to cover bounded types | ||
for all the relevant ops. | ||
burmako marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### (P3) get\_dimension\_size / set\_dimension\_size ops | ||
|
||
The `get_dimension_size` op takes a tensor and a dimension index and returns the | ||
runtime size as `tensor<i32>`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason to not use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking of generalizing to i64 and index as well for consistency. But given there is no requirement now. Seemed better to follow XLA until some use-case comes up. |
||
|
||
The following example returns the size of the result after concatenating input | ||
that has up to `16` elements with self and returns the runtime size of the | ||
concatenation result. | ||
|
||
```mlir | ||
func.func @self_concat_size(%data: tensor<?xi32, #stablehlo.type_extensions<bounds = [16]>>) -> tensor<i32> { | ||
%concat = "stablehlo.concatenate"(%data, %data) {dimension = 0 : i64} | ||
: (tensor<?xi32, #stablehlo.type_extensions<bounds = [16]>>, | ||
tensor<?xi32, #stablehlo.type_extensions<bounds = [16]>>) | ||
-> tensor<?xi32, #stablehlo.type_extensions<bounds = [32]>> | ||
|
||
%result = stablehlo.get_dimension_size %concat, dim = 0 | ||
: (tensor<?xi32, #stablehlo.type_extensions<bounds = [32]>>) -> tensor<i32> | ||
|
||
func.return %result : tensor<i32> | ||
} | ||
``` | ||
|
||
The `set_dimension_size` op takes a static or bounded tensor, runtime size and a | ||
dimension index and returns a tensor whose size of the particular dimension is | ||
set to the specified size. This size needs to be less than or equal to the | ||
static size or bound for the dimension. This operation can be thought as either | ||
sdasgup3 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a slice or pad operation depending on if the earlier dimension size is larger or | ||
smaller, respectively. In case the dimension size is increased, the padded | ||
values are undefined. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean implementation-defined? Overall, given that we're talking about some tricky semantics here, I would recommend updating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea! will soon take a stab at it and update it here.
burmako marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In the following example, `set_dimension_size` op is used to set the size of the | ||
first dimension so that it performs a sum reduction on the first `batch_size` | ||
elements in the input. With data argument `[1, 2, 3, 4]` and batch\_size | ||
argument `2`, the following function returns `3` but it returns `6` for the same | ||
data argument when the batch\_size is `3`. The `set_dimension_size` op also sets | ||
the bound on the returned tensor. This bound depends on operand's static size if | ||
the operand shape is static. It is `4` in this example. If the operand dimension | ||
is not static, then the returned tensor has same type as the operand. | ||
|
||
```mlir | ||
func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor<i32>) -> tensor<i32> { | ||
%dynamic_data = stablehlo.set_dimension_size %data, %batch_size, dim = 0 | ||
: (tensor<4xi32>, tensor<i32>) -> tensor<?xi32, #stablehlo.type_extensions<bounds = [4]>> | ||
|
||
%zero = stablehlo.constant dense<0> : tensor<i32> | ||
%sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ | ||
^bb0(%arg1: tensor<i32>, %arg2: tensor<i32>): | ||
%add = stablehlo.add %arg1, %arg2 : tensor<i32> | ||
stablehlo.return %add : tensor<i32> | ||
}) {dimensions = dense<[0]> : tensor<1xi64>} | ||
: (tensor<?xi32, #stablehlo.type_extensions<bounds = [4]>>, | ||
tensor<i32>) -> tensor<i32> | ||
func.return %sum : tensor<i32> | ||
} | ||
``` | ||
|
||
### (P4) Aspirational: Migration to unbounded dynamism | ||
|
||
In addition to `set_dimension_size` and `get_dimension_size` ops, StableHLO | ||
producers may also use unbounded dynamic ops like `real_dynamic_slice` and | ||
`dynamic_pad` to perform operations on dynamically shaped tensors. For example, | ||
the above `@dynamic_sum` computation can be performed by using the | ||
`real_dynamic_slice` op instead of the `set_dimension_size` op. With that, the | ||
above example can be rewritten as, | ||
burmako marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```mlir | ||
func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor<i32>) -> tensor<i32> { | ||
%start = stablehlo.constant dense<0> : tensor<1xi32> | ||
%limit = stablehlo.reshape %batch_size : (tensor<i32>) -> tensor<1xi32> | ||
%strides = stablehlo.constant dense<1> : tensor<1xi32> | ||
%dynamic_data = stablehlo.real_dynamic_slice %data, %start, %limit, %strides | ||
: (tensor<4xi32>, tensor<1xi32>, tensor<1xi32>, tensor<1xi32>) | ||
-> tensor<?xi32> | ||
|
||
%zero = stablehlo.constant dense<0> : tensor<i32> | ||
%sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ | ||
^bb0(%arg1: tensor<i32>, %arg2: tensor<i32>): | ||
%add = stablehlo.add %arg1, %arg2 : tensor<i32> | ||
stablehlo.return %add : tensor<i32> | ||
}) {dimensions = dense<[0]> : tensor<1xi64>} | ||
: (tensor<?xi32>, tensor<i32>) -> tensor<i32> | ||
func.return %sum : tensor<i32> | ||
} | ||
``` | ||
|
||
Originally, HLO introduced `set_dimension_size` op as it neither had dynamic | ||
types nor dynamic ops. StableHLO dialect doesn't have these limitations and | ||
therefore new users don't need to make use of this low level op unless they are | ||
moving from HLO or MHLO to StableHLO. The TensorFlow and JAX teams believe that | ||
this hypothesis should be correct based on their experiences so far. Dynamic | ||
operations `real_dynamic_slice` and `dynamic_pad` can be used instead. | ||
|
||
The following example demonstrates the differences between programs using | ||
unbounded dynamism and bounded dynamism. | ||
|
||
```mlir | ||
func.func @slice_with_unbounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor<?xf32> { | ||
%strides = stablehlo.constant dense<1> : tensor<1xi32> | ||
%result = stablehlo.real_dynamic_slice %data, %start, %limit, %strides | ||
: (tensor<7xf32>, tensor<1xi32>, tensor<1xi32>, tensor<1xi32>) | ||
-> tensor<?xf32> | ||
func.return %result : tensor<?xf32> | ||
} | ||
``` | ||
|
||
```mlir | ||
func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor<?xf32, #stablehlo.type_extensions<bounds = [7]>> { | ||
// Add padding to avoid OOM access in the following slice op. | ||
%pad_value = stablehlo.constant dense<0.0> : tensor<f32> | ||
%padded_data = stablehlo.pad %data, %pad_value, low = [0], high = [7], interior = [0] | ||
: (tensor<7xf32>, tensor<f32>) -> tensor<14xf32> | ||
|
||
// Extract the largest possible slice starting at the start index. | ||
%scalar_start = stablehlo.reshape %start : (tensor<1xi32>) -> tensor<i32> | ||
%padded_result = stablehlo.dynamic_slice %padded_data, %scalar_start, sizes = [7] | ||
: (tensor<14xf32>, tensor<i32>) -> tensor<7xf32> | ||
|
||
// Remove the extra elements extracted beyond the limit. | ||
%slice_size = stablehlo.subtract %limit, %start : tensor<1xi32> | ||
%scalar_size = stablehlo.reshape %slice_size : (tensor<1xi32>) -> tensor<i32> | ||
%result = stablehlo.set_dimension_size %padded_result, %scalar_size, dim = 0 | ||
: (tensor<7xf32>, tensor<i32>) | ||
-> tensor<?xf32, #stablehlo.type_extensions<bounds = [7]>> | ||
|
||
func.return %result : tensor<?xf32, #stablehlo.type_extensions<bounds = [7]>> | ||
} | ||
|
||
``` | ||
|
||
Use of unbounded dynamic ops over `set_dimension_size` op has a couple of | ||
benefits: | ||
|
||
* Greatly simplifies the lowering from higher level frameworks to StableHLO as | ||
they don't need to compute upper bounds of dynamic dimensions. | ||
smit-hinsu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Makes lowerings to StableHLO hardware agnostic as they don't depend on if the | ||
hardware requires unbounded or bounded programs. | ||
|
||
Benefit of the `set_dimension_size` op: | ||
|
||
Given that the runtime size argument of `set_dimension_size` op is required to | ||
be less than or equal to the static size or bound, compiler could separately | ||
track runtime size of the tensor and keep a buffer of fixed size according to | ||
the bound. This helps avoid any data movements for the `set_dimension_size` op | ||
at the cost of extra memory. However, compilers should be making the trade-off | ||
between copy and additional memory based on the hardware capabilities and not | ||
the frameworks. It is possible to lower slice op to `set_dimension_size` op | ||
easily but going in the other direction is tricky. That would require program | ||
analysis to make sure that the size of the buffer is not increased later on. | ||
|
||
## Alternatives Considered | ||
|
||
### Not having bounded type and/or set\_dimension\_size op | ||
|
||
Given the use of unbounded dynamism, could StableHLO just have a function | ||
attribute to store the input bounds instead of having bounded type and | ||
`set_dimension_size` op? This might be possible but this will pose significant | ||
challenges for existing users generating bounded programs. Current proposal | ||
allows users to incrementally move to unbounded dynamism for new implementations | ||
while immediately making use of StableHLO without generating a mix of StableHLO | ||
and MHLO programs. | ||
smit-hinsu marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is bound propagation in scope? E.g., given program with bounds on inputs/certain ops, propagate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not. There is some related work in #622