From e264452f555a3c9373890848a72837ccaa557385 Mon Sep 17 00:00:00 2001 From: Smit Hinsu Date: Sun, 25 Sep 2022 23:12:51 -0700 Subject: [PATCH 01/17] Bounded Dynamism RFC --- docs/bounded_dynamism.md | 85 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 docs/bounded_dynamism.md diff --git a/docs/bounded_dynamism.md b/docs/bounded_dynamism.md new file mode 100644 index 00000000000..60e139b514a --- /dev/null +++ b/docs/bounded_dynamism.md @@ -0,0 +1,85 @@ +# StableHLO Bounded Dynamism + +[StableHLO](https://github.com/openxla/stablehlo) is an operation set that expresses ML computations. It has been originally bootstrapped from [the MHLO dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), including inheriting the type and some of the ops. This RFC aims to describe the current status and future plans for bounded dynamism in StableHLO. In particular, this RFC doesn’t propose any further major changes to the current state. The only minor change proposed is allowing `i64` typed result for `get_dimension_size` op and `i64` type for `set_dimension_size` `size` operand. This is further described in P3. + +Bounded dynamism allows computations to encode the maximum runtime size of dynamic dimensions of tensors. This makes it possible to run particular computations on hardware that doesn’t support dynamic tensors and needs hints on the largest possible. For example, real time inference might require support for dynamic batch so that inference doesn’t need to wait for accumulation up to the static batch size. Similarly, during training the last batch in a dataset might be smaller than the max batch size if the number of examples in the dataset is not divisible by the batch size. + + +# Proposal + + +## (P1) Use RankedTensorType encoding field for bounds + +Bounds for a dynamic tensor are represented using the `TypeExtensionsAttr` using the encoding field in the `RankedTensorType`. Bounds in `TypeExtensionsAttr` is an `int64_t` array of size equal to rank of the tensor. Values corresponding to static dimensions must be `ShapedType::kDynamicSize` which is `-1`. + +For example, the following type represents a 2D tensor with up to `3` rows and exactly `5` columns. + + +``` +tensor> +``` + + +Type compatibility in StableHLO also checks for compatibility of the bounds. Two types are compatible if there exists a runtime tensor that could match both the types. So, two types with different bounds are compatible but a type with bounds that is lower than the static size in the other type are not compatible. The above example type is compatible with `tensor<2x5xf32>` but not with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. + +Currently, StableHLO dialect is using the MLIR core ranked tensor type to encode bounds. It should be noted that there is a plan to introduce a custom StableHLO type in the future that could natively support bounds along with custom pretty printing format. There will be a separate RFC describing all the details. Also, the proposal will follow StableHLO backward compatibility policies so it is safe to use this type currently. + + +## (P2) StableHLO op semantics with bounded operands or results + +All ops that support dynamic operands or results can have bounds specified for them. However, the result types needs to be compatible with the inferrable result types using the operands and attributes. This allows result types to be more generic or specific as long as it is compatible with the inferred type. + + +## (P3) get\_dimension\_size / set\_dimension\_size ops + +The `get_dimension_size` op takes a tensor and dimension index as operands and returns the actual size of the dimension at runtime as `i32` type scalar. + +The following example returns the size of the result after concatenating input that has up to `16` elements with self and returns the actual runtime size of the concatenation result. + + +``` +func.func @self_concat_size(%data: tensor>) -> tensor { + %concat = "stablehlo.concatenate"(%data, %data) {dimension = 0 : i64} + : (tensor>, + tensor>) + -> tensor> + + %result = "stablehlo.get_dimension_size"(%concat) {dimension = 0 : i64} + : (tensor>) -> tensor + + func.return %result : tensor +} +``` + + +The `set_dimension_size` op takes a tensor, runtime size and dimension index as operands and returns a tensor whose logical size of the particular dimension is set to the specified value. This size needs to be less than or equal to the static size, if available. + +In the following example, set dimension size is used to set the logical size of the first dimension. With input `[1, 2, 3, 4]`, it returns `3` with batch\_size equal to `2` but returns `6` with batch\_size `3`. + + +``` +func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { + %dim = stablehlo.constant dense<0> : tensor + %dynamic_data = "stablehlo.set_dimension_size"(%data, %batch_size, %dim) + : (tensor<4xi32>, tensor, tensor) + -> tensor> + + %zero = stablehlo.constant dense<0> : tensor + %sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ + ^bb0(%arg1: tensor, %arg2: tensor): + %add = stablehlo.add %arg1, %arg2 : tensor + "stablehlo.return"(%add) : (tensor) -> () + }) {dimensions = dense<[0]> : tensor<1xi64>} + : (tensor>, + tensor) -> tensor + func.return %sum : tensor +} +``` + + +Currently, `get_dimension_size` result` and set_dimension_size `size operand only supports `i32` type for legacy reasons. This is not consistent with other types for dimension sizes so it should be expanded to also allow `i64` typed operand for `size`. The default result type for `get_dimension_size` should be `i64` in the shape inference functions but allows `i32` types as well. + + +## (P4) Prefer generic dynamic ops over set\_dimension\_size op + +Note that in the above example `@dynamic_sum`, the same computation can be done by using the `slice` op instead of the `set_dimension_size` op. Note that the `slice` op is not a drop in replacement for the `set_dimension_size` op as it doesn’t allow setting the dimension size to higher than the existing logical size. However, extending size is not generally required in real world programs. Therefore, it is preferable to use relevant dynamic ops over restricting the StableHLO program to have all static or bounded types. This applies to hardware like XLA:TPU as well that requires programs to be either static or bounded. But, this need not happen at the StableHLO level and it could have dynamic types that are then refined to bounded types in later stages of the compilation. This way StableHLO producers can be hardware and compiler agnostics and semantics are obvious to anyone not familiar with bounded dynamism ops. However, this approach won’t work if the intended consumer of StableHLO doesn’t support dynamic ops by converting inputs to bounded versions internally. In that case, `set_dimension_size` op would be the only way to represent bounded dynamism. From 43f028a676ec0acb2b907c6ead25f0327e3f4833 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 04:31:09 -0800 Subject: [PATCH 02/17] Move bounded dynamism RFC to rfcs directory --- {docs => rfcs}/bounded_dynamism.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {docs => rfcs}/bounded_dynamism.md (100%) diff --git a/docs/bounded_dynamism.md b/rfcs/bounded_dynamism.md similarity index 100% rename from docs/bounded_dynamism.md rename to rfcs/bounded_dynamism.md From c70c6a93fca1c991e3f9383a5e7304362bdba507 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 04:44:01 -0800 Subject: [PATCH 03/17] Expand section recommending unbounded dynamism and other updates --- rfcs/bounded_dynamism.md | 224 ++++++++++++++++++++++++++++++++------- 1 file changed, 186 insertions(+), 38 deletions(-) diff --git a/rfcs/bounded_dynamism.md b/rfcs/bounded_dynamism.md index 60e139b514a..ae6642a08d6 100644 --- a/rfcs/bounded_dynamism.md +++ b/rfcs/bounded_dynamism.md @@ -1,85 +1,233 @@ # StableHLO Bounded Dynamism -[StableHLO](https://github.com/openxla/stablehlo) is an operation set that expresses ML computations. It has been originally bootstrapped from [the MHLO dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), including inheriting the type and some of the ops. This RFC aims to describe the current status and future plans for bounded dynamism in StableHLO. In particular, this RFC doesn’t propose any further major changes to the current state. The only minor change proposed is allowing `i64` typed result for `get_dimension_size` op and `i64` type for `set_dimension_size` `size` operand. This is further described in P3. - -Bounded dynamism allows computations to encode the maximum runtime size of dynamic dimensions of tensors. This makes it possible to run particular computations on hardware that doesn’t support dynamic tensors and needs hints on the largest possible. For example, real time inference might require support for dynamic batch so that inference doesn’t need to wait for accumulation up to the static batch size. Similarly, during training the last batch in a dataset might be smaller than the max batch size if the number of examples in the dataset is not divisible by the batch size. - - -# Proposal - - -## (P1) Use RankedTensorType encoding field for bounds - -Bounds for a dynamic tensor are represented using the `TypeExtensionsAttr` using the encoding field in the `RankedTensorType`. Bounds in `TypeExtensionsAttr` is an `int64_t` array of size equal to rank of the tensor. Values corresponding to static dimensions must be `ShapedType::kDynamicSize` which is `-1`. - -For example, the following type represents a 2D tensor with up to `3` rows and exactly `5` columns. - +[StableHLO](https://github.com/openxla/stablehlo) is an operation set that +expresses ML computations. It has been originally bootstrapped from the [MHLO +dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), +including inheriting the type and some ops. This RFC aims to describe the +current status and rationale for bounded dynamism constructs in StableHLO and +provides recommendations to StableHLO producers and consumers. In particular, +this RFC doesn’t propose any further changes to the current state but this RFC +should still be used to revisit those decisions given that those weren't +reviewed. + +Bounded dynamism allows programs to represent the maximum runtime size that a +particular dynamic dimensions of tensor can have. This makes it possible to run +such programs on hardware that don't support dynamic tensors but could support +it if the upper bounds of tensor dimensions are known at the compilation time. +With bounded dynamism, real time inference systems don't need to wait for +accumulation up to a particular batch size on these hardware. Bounded dynamism +also makes it possible to support programs whose intermediate tensor shapes +depend on the inputs. For example, +[`stablehlo.dynamic_broadcast_in_dim`](https://github.com/openxla/stablehlo/blob/ff55f9346d54e9e38de807a79f8ae03faffda274/stablehlo/dialect/StablehloOps.td#L1838) +op but with statically known upper bounds of `output_dimensions` operand. Even +on hardware that supports dynamic tensors, bounded dynamism can open up +opportunities of performance optimizations. + +# Recommendations + +## StableHLO Producers + +* Producers should use bounded tensor type representation as described in P1. +* Producers are encouraged to use unbounded dynamic operations for reasons + described in P4. +* Producers can still use `get_dimension_size` and `set_dimension_size` ops + described in P3 for the ease of transition to StableHLO and faster adoption of + StableHLO. + +## StableHLO Consumers + +* Consumers should aim to support unbounded programs and can optionally make use + of bounds on tensors for optimizations. +* Consumers that support unbounded programs can safely ignore the bounds + completely without affecting the correctness. +* Consumers that only support bounded programs could first transform the given + program to a bounded one through program analysis. +* Consumers can choose to not support `get_dimension_size` and + `set_dimension_size` ops until there is a motivating use-case. + +# Non Goals + +* Provide [value inference](https://github.com/openxla/xla/blob/9e05932a2ceadea080dc9494cfe9d735f94c4e68/xla/client/value_inference.h) + like utility for producers that want to generate the `set_dimension_size` op. + Value inference depends on constant folding for StableHLO ops which is a work + in progress currently. Value inference will be designed and provided + separately in the future. Note that producers that generate unbounded programs + don't need this in StableHLO. +* Provide a transformation that converts StableHLO programs to bounded StableHLO + programs, if possible. There is a plan to have such a conversion in MHLO and + StableHLO users can utilize that by round tripping to MHLO. Details of this + are outside the scope of this RFC. + +# Detailed Proposal + + +## (P1) Bounded tensor type using the encoding field in the RankedTensorType + +Bounds for a dynamic tensor are represented using the `TypeExtensionsAttr` in +the `RankedTensorType` encoding field. Bounds in `TypeExtensionsAttr` is an +`int64_t` array of size equal to rank of the tensor. Values corresponding to +static dimensions must be `ShapedType::kDynamicSize` which is printed as `?` in +the IR. Disallowing a static bound value for static dimensions makes the IR +canonical and makes it possible to infer that the dimension is dynamic if the +bound value is static. + +The following type represents a 2D tensor with up to `3` rows and exactly `5` +columns. ``` -tensor> +tensor> ``` +Type compatibility in StableHLO also checks for compatibility of the bounds. Two +types are compatible if there exists a runtime tensor that could match both the +types. So, two types with different bounds are compatible but a type with bounds +that is lower than the static size in the other type are not compatible. The +above example type is compatible with `tensor<2x5xf32>` but not with +`tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. -Type compatibility in StableHLO also checks for compatibility of the bounds. Two types are compatible if there exists a runtime tensor that could match both the types. So, two types with different bounds are compatible but a type with bounds that is lower than the static size in the other type are not compatible. The above example type is compatible with `tensor<2x5xf32>` but not with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. - -Currently, StableHLO dialect is using the MLIR core ranked tensor type to encode bounds. It should be noted that there is a plan to introduce a custom StableHLO type in the future that could natively support bounds along with custom pretty printing format. There will be a separate RFC describing all the details. Also, the proposal will follow StableHLO backward compatibility policies so it is safe to use this type currently. +Currently, the StableHLO dialect is using the MLIR core ranked tensor type to +represent bounds. It should be noted that there is a plan to introduce a custom +StableHLO type in the future that could natively support bounds along with +custom pretty printing format. There will be a separate RFC on this. Also, the +proposal will follow StableHLO backward compatibility policies so it is safe to +use this type currently. ## (P2) StableHLO op semantics with bounded operands or results -All ops that support dynamic operands or results can have bounds specified for them. However, the result types needs to be compatible with the inferrable result types using the operands and attributes. This allows result types to be more generic or specific as long as it is compatible with the inferred type. +All ops that support dynamic operands or results can have bounds specified for +them. However, the result types need to be compatible with the inferred result +types. This allows result types to be more generic or specific as long as it is +compatible with the inferred type. +Separately, StableHLO specification will be updated to cover bounded types for +all the relevant ops. ## (P3) get\_dimension\_size / set\_dimension\_size ops -The `get_dimension_size` op takes a tensor and dimension index as operands and returns the actual size of the dimension at runtime as `i32` type scalar. - -The following example returns the size of the result after concatenating input that has up to `16` elements with self and returns the actual runtime size of the concatenation result. +The `get_dimension_size` op takes a tensor and dimension index as operands and +returns the actual size of the dimension at runtime as an `i32` type scalar. +The following example returns the size of the result after concatenating input +that has up to `16` elements with self and returns the actual runtime size of +the concatenation result. ``` func.func @self_concat_size(%data: tensor>) -> tensor { %concat = "stablehlo.concatenate"(%data, %data) {dimension = 0 : i64} - : (tensor>, + : (tensor>, tensor>) -> tensor> - %result = "stablehlo.get_dimension_size"(%concat) {dimension = 0 : i64} + %result = stablehlo.get_dimension_size %concat, dim = 0 : (tensor>) -> tensor - + func.return %result : tensor } ``` +The `set_dimension_size` op takes a bounded tensor, runtime size and dimension +index as operands and returns a tensor whose logical size of the particular +dimension is set to the specified size. This size needs to be less than or equal +to the static size or bound for the dimension. This operation can be thought as +either a slice or pad operation depending on if the earlier logical dimension +size is larger or smaller, respectively. In case the dimension size is +increased, the padded values are undefined. -The `set_dimension_size` op takes a tensor, runtime size and dimension index as operands and returns a tensor whose logical size of the particular dimension is set to the specified value. This size needs to be less than or equal to the static size, if available. - -In the following example, set dimension size is used to set the logical size of the first dimension. With input `[1, 2, 3, 4]`, it returns `3` with batch\_size equal to `2` but returns `6` with batch\_size `3`. - +In the following example, `set_dimension_size` op is used to set the logical +size of the first dimension so that it performs a sum reduction on the first +`batch_size` elements in the input. With data argument `[1, 2, 3, 4]` and +batch\_size argument `2`, the following function returns `3` but it returns `6` +for the same data argument when the batch\_size is `3`. ``` func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { - %dim = stablehlo.constant dense<0> : tensor - %dynamic_data = "stablehlo.set_dimension_size"(%data, %batch_size, %dim) - : (tensor<4xi32>, tensor, tensor) - -> tensor> - + %dynamic_data = stablehlo.set_dimension_size %data, %batch_size, dim = 0 + : (tensor<4xi32>, tensor) -> tensor> + %zero = stablehlo.constant dense<0> : tensor %sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ ^bb0(%arg1: tensor, %arg2: tensor): %add = stablehlo.add %arg1, %arg2 : tensor "stablehlo.return"(%add) : (tensor) -> () }) {dimensions = dense<[0]> : tensor<1xi64>} - : (tensor>, + : (tensor>, tensor) -> tensor func.return %sum : tensor } ``` +## (P4) Prefer generic dynamic ops over set\_dimension\_size op -Currently, `get_dimension_size` result` and set_dimension_size `size operand only supports `i32` type for legacy reasons. This is not consistent with other types for dimension sizes so it should be expanded to also allow `i64` typed operand for `size`. The default result type for `get_dimension_size` should be `i64` in the shape inference functions but allows `i32` types as well. +Note that in the above example of `@dynamic_sum` function, the same computation +can be done by using the `real_dynamic_slice` op instead of the +`set_dimension_size` op. The following example demonstrates this. +``` +func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { + %start = stablehlo.constant dense<0> : tensor<1xi32> + %limit = stablehlo.reshape %batch_size : (tensor) -> tensor<1xi32> + %strides = stablehlo.constant dense<0> : tensor<1xi32> + %dynamic_data = stablehlo.real_dynamic_slice %data, %start, %limit, %strides + : (tensor<4xi32>, tensor<1xi32>, tensor<1xi32>, tensor<1xi32>) + -> tensor -## (P4) Prefer generic dynamic ops over set\_dimension\_size op + %zero = stablehlo.constant dense<0> : tensor + %sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ + ^bb0(%arg1: tensor, %arg2: tensor): + %add = stablehlo.add %arg1, %arg2 : tensor + "stablehlo.return"(%add) : (tensor) -> () + }) {dimensions = dense<[0]> : tensor<1xi64>} + : (tensor, tensor) -> tensor + func.return %sum : tensor +} +``` -Note that in the above example `@dynamic_sum`, the same computation can be done by using the `slice` op instead of the `set_dimension_size` op. Note that the `slice` op is not a drop in replacement for the `set_dimension_size` op as it doesn’t allow setting the dimension size to higher than the existing logical size. However, extending size is not generally required in real world programs. Therefore, it is preferable to use relevant dynamic ops over restricting the StableHLO program to have all static or bounded types. This applies to hardware like XLA:TPU as well that requires programs to be either static or bounded. But, this need not happen at the StableHLO level and it could have dynamic types that are then refined to bounded types in later stages of the compilation. This way StableHLO producers can be hardware and compiler agnostics and semantics are obvious to anyone not familiar with bounded dynamism ops. However, this approach won’t work if the intended consumer of StableHLO doesn’t support dynamic ops by converting inputs to bounded versions internally. In that case, `set_dimension_size` op would be the only way to represent bounded dynamism. +Originally, XLA HLO introduced `set_dimension_size` op as it neither had dynamic +types nor dynamic ops. StableHLO dialect doesn't have these limitations and +therefore new users don't need to make use of this low level op. The TensorFlow +and JAX teams believe belives that this hypothesis should be correct based on +their experiences so far. Dynamic operations `real_dynamic_slice` and +`dynamic_pad` can be used instead. + +Use of dynamic ops over `set_dimension_size` op has various benefits: + +* Greatly simplifies lowering from higher level frameworks to StableHLO as they + don't need to make use value inference to compute bounds or generate low level + ops. +* Opens up the opportunity to share conversion to bounded programs between + frameworks and compilers. Therefore, frameworks can immediately target new + hardware requiring bounded programs even if they didn't already support that. + Data dependent bounded dynamism won't require any changes and input dependent + bounded dynamism can be supported by just specifying the bounds on the inputs. +* Makes lowerings to StableHLO hardware agnostic and they don't depend on if the + compiler requires unbounded or bounded programs. +* Reduces the potential confusion in making use of `set_dimension_size` as the + users are generally not familiar with this op and also the semantics are also + not intuitive. + +It is true that the `set_dimension_size` semantics allows making in-place +updates. However, compilers should be making the trade-off between copy and +additional memory based on the hardware capabilities. It is also possible to +lower slice op to `set_dimension_size` op easily but going in the other +direction is tricky. That would require program analysis to make sure that the +logical size of the buffer is not increased later on. + +# Alternatives Considered + +## Not having bounded type and/or set\_dimension\_size op + +Given the recommendation of using unbounded dynamism, could StableHLO just have +a function attribute to store the input bounds instead of having bounded type +and `set_dimension_size` op? This might be possible but this will pose +significant challenges for existing users generating bounded programs. Current +proposal allows users to incrementally move to the recommended approach for new +implementations while immediately making use of StableHLO without generating a +mix of StableHLO and MHLO programs. + +It is true that having the bounded type and `set_dimension_size` op introduces +some complexity but given that the bounds are optional, users that don't care +about bounded dynamism don't need to worry about these. All the code complexity +is limited to the StableHLO shape functions. These also affects the op +specifications but these should be intuitive to users based on the op semantics +and making use of StableHLO shape functions should hide that as well. From 4e65d6d462b8d415a353418b6e0b642bee5d7697 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 04:56:41 -0800 Subject: [PATCH 04/17] Fix typos --- rfcs/bounded_dynamism.md | 72 ++++++++++++++++++++-------------------- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/rfcs/bounded_dynamism.md b/rfcs/bounded_dynamism.md index ae6642a08d6..fa598543f39 100644 --- a/rfcs/bounded_dynamism.md +++ b/rfcs/bounded_dynamism.md @@ -8,20 +8,20 @@ current status and rationale for bounded dynamism constructs in StableHLO and provides recommendations to StableHLO producers and consumers. In particular, this RFC doesn’t propose any further changes to the current state but this RFC should still be used to revisit those decisions given that those weren't -reviewed. +discussed publically. Bounded dynamism allows programs to represent the maximum runtime size that a -particular dynamic dimensions of tensor can have. This makes it possible to run +particular dynamic dimension of a tensor can have. This makes it possible to run such programs on hardware that don't support dynamic tensors but could support it if the upper bounds of tensor dimensions are known at the compilation time. With bounded dynamism, real time inference systems don't need to wait for -accumulation up to a particular batch size on these hardware. Bounded dynamism +accumulation up to a particular batch size on such hardware. Bounded dynamism also makes it possible to support programs whose intermediate tensor shapes depend on the inputs. For example, [`stablehlo.dynamic_broadcast_in_dim`](https://github.com/openxla/stablehlo/blob/ff55f9346d54e9e38de807a79f8ae03faffda274/stablehlo/dialect/StablehloOps.td#L1838) op but with statically known upper bounds of `output_dimensions` operand. Even on hardware that supports dynamic tensors, bounded dynamism can open up -opportunities of performance optimizations. +performance optimizations opportunities. # Recommendations @@ -43,16 +43,16 @@ opportunities of performance optimizations. * Consumers that only support bounded programs could first transform the given program to a bounded one through program analysis. * Consumers can choose to not support `get_dimension_size` and - `set_dimension_size` ops until there is a motivating use-case. + `set_dimension_size` ops until they have a motivating use-case. # Non Goals * Provide [value inference](https://github.com/openxla/xla/blob/9e05932a2ceadea080dc9494cfe9d735f94c4e68/xla/client/value_inference.h) - like utility for producers that want to generate the `set_dimension_size` op. + like utility for producers that want to generate the `set_dimension_size` ops. Value inference depends on constant folding for StableHLO ops which is a work - in progress currently. Value inference will be designed and provided - separately in the future. Note that producers that generate unbounded programs - don't need this in StableHLO. + in progress currently. Value inference will be designed separately in the + future. Note that producers that generate unbounded programs don't need this + in StableHLO. * Provide a transformation that converts StableHLO programs to bounded StableHLO programs, if possible. There is a plan to have such a conversion in MHLO and StableHLO users can utilize that by round tripping to MHLO. Details of this @@ -63,8 +63,8 @@ opportunities of performance optimizations. ## (P1) Bounded tensor type using the encoding field in the RankedTensorType -Bounds for a dynamic tensor are represented using the `TypeExtensionsAttr` in -the `RankedTensorType` encoding field. Bounds in `TypeExtensionsAttr` is an +Bounds of a dynamic tensor are represented using the `TypeExtensionsAttr` in the +`RankedTensorType` encoding field. Bounds in `TypeExtensionsAttr` is an `int64_t` array of size equal to rank of the tensor. Values corresponding to static dimensions must be `ShapedType::kDynamicSize` which is printed as `?` in the IR. Disallowing a static bound value for static dimensions makes the IR @@ -105,8 +105,8 @@ all the relevant ops. ## (P3) get\_dimension\_size / set\_dimension\_size ops -The `get_dimension_size` op takes a tensor and dimension index as operands and -returns the actual size of the dimension at runtime as an `i32` type scalar. +The `get_dimension_size` op takes a tensor and a dimension index and returns the +actual size of the dimension at runtime as an `i32` type scalar. The following example returns the size of the result after concatenating input that has up to `16` elements with self and returns the actual runtime size of @@ -126,13 +126,13 @@ func.func @self_concat_size(%data: tensor, %batch_size: tensor) -> tensor ## (P4) Prefer generic dynamic ops over set\_dimension\_size op Note that in the above example of `@dynamic_sum` function, the same computation -can be done by using the `real_dynamic_slice` op instead of the +can be performed by using the `real_dynamic_slice` op instead of the `set_dimension_size` op. The following example demonstrates this. ``` @@ -186,32 +186,32 @@ func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor Originally, XLA HLO introduced `set_dimension_size` op as it neither had dynamic types nor dynamic ops. StableHLO dialect doesn't have these limitations and therefore new users don't need to make use of this low level op. The TensorFlow -and JAX teams believe belives that this hypothesis should be correct based on -their experiences so far. Dynamic operations `real_dynamic_slice` and -`dynamic_pad` can be used instead. +and JAX teams believe that this hypothesis should be correct based on their +experiences so far. Dynamic operations `real_dynamic_slice` and `dynamic_pad` +can be used instead. Use of dynamic ops over `set_dimension_size` op has various benefits: -* Greatly simplifies lowering from higher level frameworks to StableHLO as they - don't need to make use value inference to compute bounds or generate low level - ops. +* Greatly simplifies the lowering from higher level frameworks to StableHLO as + they don't need to make use of the value inference to compute bounds or + generate low level ops. * Opens up the opportunity to share conversion to bounded programs between frameworks and compilers. Therefore, frameworks can immediately target new hardware requiring bounded programs even if they didn't already support that. Data dependent bounded dynamism won't require any changes and input dependent bounded dynamism can be supported by just specifying the bounds on the inputs. -* Makes lowerings to StableHLO hardware agnostic and they don't depend on if the - compiler requires unbounded or bounded programs. +* Makes lowerings to StableHLO hardware agnostic as they don't depend on if the + hardware requires unbounded or bounded programs. * Reduces the potential confusion in making use of `set_dimension_size` as the - users are generally not familiar with this op and also the semantics are also - not intuitive. + users are generally not familiar with this op and also the semantics are not + intuitive. It is true that the `set_dimension_size` semantics allows making in-place updates. However, compilers should be making the trade-off between copy and -additional memory based on the hardware capabilities. It is also possible to -lower slice op to `set_dimension_size` op easily but going in the other -direction is tricky. That would require program analysis to make sure that the -logical size of the buffer is not increased later on. +additional memory based on the hardware capabilities and not the frameworks. It +is possible to lower slice op to `set_dimension_size` op easily but going in the +other direction is tricky. That would require program analysis to make sure that +the logical size of the buffer is not increased later on. # Alternatives Considered @@ -227,7 +227,7 @@ mix of StableHLO and MHLO programs. It is true that having the bounded type and `set_dimension_size` op introduces some complexity but given that the bounds are optional, users that don't care -about bounded dynamism don't need to worry about these. All the code complexity +about bounded dynamism don't encounter the complexity. All the code complexity is limited to the StableHLO shape functions. These also affects the op specifications but these should be intuitive to users based on the op semantics and making use of StableHLO shape functions should hide that as well. From e8bf5cdbe7c8518087b36371350e22f5fb563264 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 05:09:15 -0800 Subject: [PATCH 05/17] Describe set_dimension_size shape function --- rfcs/bounded_dynamism.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/rfcs/bounded_dynamism.md b/rfcs/bounded_dynamism.md index fa598543f39..e5b71dea4d3 100644 --- a/rfcs/bounded_dynamism.md +++ b/rfcs/bounded_dynamism.md @@ -138,7 +138,9 @@ In the following example, `set_dimension_size` op is used to set the logical size of the first dimension so that it performs a sum reduction on the first `batch_size` elements in the input. With data argument `[1, 2, 3, 4]` and batch\_size argument `2`, the following function returns `3` but it returns `6` -for the same data argument when the batch\_size is `3`. +for the same data argument when the batch\_size is `3`. Operand static dimension +or bound becomes bound of the `set_dimension_size` op's result. It is `4` in +this example. ``` func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { From 421a5533c49036c359db3d0cb7152d9575489a38 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 18:20:52 -0800 Subject: [PATCH 06/17] Rename RFC to 20220925-bounded-dynamism.md --- rfcs/{bounded_dynamism.md => 20220925-bounded-dynamism.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{bounded_dynamism.md => 20220925-bounded-dynamism.md} (100%) diff --git a/rfcs/bounded_dynamism.md b/rfcs/20220925-bounded-dynamism.md similarity index 100% rename from rfcs/bounded_dynamism.md rename to rfcs/20220925-bounded-dynamism.md From 5526dc21475ba689018b5dddb0e7d9513be25cbb Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 18:28:39 -0800 Subject: [PATCH 07/17] Fix strides in the example --- rfcs/20220925-bounded-dynamism.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index e5b71dea4d3..8c3e53dfc75 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -169,7 +169,7 @@ can be performed by using the `real_dynamic_slice` op instead of the func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { %start = stablehlo.constant dense<0> : tensor<1xi32> %limit = stablehlo.reshape %batch_size : (tensor) -> tensor<1xi32> - %strides = stablehlo.constant dense<0> : tensor<1xi32> + %strides = stablehlo.constant dense<1> : tensor<1xi32> %dynamic_data = stablehlo.real_dynamic_slice %data, %start, %limit, %strides : (tensor<4xi32>, tensor<1xi32>, tensor<1xi32>, tensor<1xi32>) -> tensor From 570d28cbff1045c30967e9cd654b847bbe2e4e0e Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 8 Dec 2022 19:09:50 -0800 Subject: [PATCH 08/17] Add example demonstrating differences between unbounded and bounded dynamism --- rfcs/20220925-bounded-dynamism.md | 61 +++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 8c3e53dfc75..30674108dd3 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -192,6 +192,67 @@ and JAX teams believe that this hypothesis should be correct based on their experiences so far. Dynamic operations `real_dynamic_slice` and `dynamic_pad` can be used instead. +The following example demonstrates the differences between programs using +unbounded dynamism and bounded dynamism. + +``` +func.func @slice_with_unbounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor { + %strides = stablehlo.constant dense<1> : tensor<1xi32> + %result = stablehlo.real_dynamic_slice %data, %start, %limit, %strides + : (tensor<7xf32>, tensor<1xi32>, tensor<1xi32>, tensor<1xi32>) + -> tensor + func.return %result : tensor +} +``` + +``` +func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor> { + %zero = stablehlo.constant dense<0> : tensor<1xi32> + %size = stablehlo.constant dense<7> : tensor<1xi32> + + // Conditionally add size of the input data if the indices are negative. + %is_start_neg = stablehlo.compare LT, %start, %zero + : (tensor<1xi32>, tensor<1xi32>) -> tensor<1xi1> + %non_negative_start = "stablehlo.if"(%is_start_neg) ({ + %1 = stablehlo.add %start, %size : tensor<1xi32> + "stablehlo.return"(%1) : (tensor<1xi32>) -> () + }, { + "stablehlo.return"(%start) : (tensor<1xi32>) -> () + }) : (tensor<1xi1>) -> tensor<1xi32> + + %is_limit_neg = stablehlo.compare LT, %limit, %zero + : (tensor<1xi32>, tensor<1xi32>) -> tensor<1xi1> + %non_negative_limit = "stablehlo.if"(%is_limit_neg) ({ + %1 = stablehlo.add %limit, %size : tensor<1xi32> + "stablehlo.return"(%1) : (tensor<1xi32>) -> () + }, { + "stablehlo.return"(%limit) : (tensor<1xi32>) -> () + }) : (tensor<1xi1>) -> tensor<1xi32> + + // Add padding to avoid OOM access in the following slice op. + %pad_value = stablehlo.constant dense<0.0> : tensor + %padded_data = stablehlo.pad %data, %pad_value, low = [0], high = [7], interior = [0] + : (tensor<7xf32>, tensor) -> tensor<14xf32> + + // Extract the largest possible slice starting at the start index. + %scalar_start = stablehlo.reshape %non_negative_start + : (tensor<1xi32>) -> tensor + %padded_result = stablehlo.dynamic_slice %padded_data, %scalar_start, sizes = [7] + : (tensor<14xf32>, tensor) -> tensor<7xf32> + + // Remove the extra elements extracted beyond the limit. + %slice_size = stablehlo.subtract %non_negative_limit, %non_negative_start + : tensor<1xi32> + %scalar_size = stablehlo.reshape %slice_size : (tensor<1xi32>) -> tensor + %result = stablehlo.set_dimension_size %padded_result, %scalar_size, dim = 0 + : (tensor<7xf32>, tensor) + -> tensor> + + func.return %result : tensor> +} + +``` + Use of dynamic ops over `set_dimension_size` op has various benefits: * Greatly simplifies the lowering from higher level frameworks to StableHLO as From 4a2fe75d9934cd8f01e594ab185ccd3d2e6b4e46 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Wed, 14 Dec 2022 16:55:13 -0800 Subject: [PATCH 09/17] Fix typos --- rfcs/20220925-bounded-dynamism.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 30674108dd3..a9450bd2183 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -8,7 +8,7 @@ current status and rationale for bounded dynamism constructs in StableHLO and provides recommendations to StableHLO producers and consumers. In particular, this RFC doesn’t propose any further changes to the current state but this RFC should still be used to revisit those decisions given that those weren't -discussed publically. +discussed publicly. Bounded dynamism allows programs to represent the maximum runtime size that a particular dynamic dimension of a tensor can have. This makes it possible to run @@ -290,7 +290,7 @@ mix of StableHLO and MHLO programs. It is true that having the bounded type and `set_dimension_size` op introduces some complexity but given that the bounds are optional, users that don't care -about bounded dynamism don't encounter the complexity. All the code complexity -is limited to the StableHLO shape functions. These also affects the op +about bounded dynamism don't encounter complexity. All the code complexity is +limited to the StableHLO shape functions. These also affects the op specifications but these should be intuitive to users based on the op semantics and making use of StableHLO shape functions should hide that as well. From a8addb2ee7f029b9b9b379812880aef3659483f5 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Thu, 15 Dec 2022 15:30:20 -0800 Subject: [PATCH 10/17] Rephrase set_dimension_size shape inference --- rfcs/20220925-bounded-dynamism.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index a9450bd2183..42ae4db1ab4 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -138,9 +138,11 @@ In the following example, `set_dimension_size` op is used to set the logical size of the first dimension so that it performs a sum reduction on the first `batch_size` elements in the input. With data argument `[1, 2, 3, 4]` and batch\_size argument `2`, the following function returns `3` but it returns `6` -for the same data argument when the batch\_size is `3`. Operand static dimension -or bound becomes bound of the `set_dimension_size` op's result. It is `4` in -this example. +for the same data argument when the batch\_size is `3`. The `set_dimension_size` +op also sets the bound on the returned tensor. This bound depends on operand's +static size if the operand shape is static. It is `4` in this example. If the +operand dimension is not static, then the returned tensor has same type as the +operand. ``` func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { From 71d01753bf79a126eb1a8d5ead03d456689b8d8e Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Fri, 16 Dec 2022 12:54:49 -0800 Subject: [PATCH 11/17] Make requirements on types explicit based on jingpu@'s feedback --- rfcs/20220925-bounded-dynamism.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 42ae4db1ab4..fe2a33adb69 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -28,8 +28,10 @@ performance optimizations opportunities. ## StableHLO Producers * Producers should use bounded tensor type representation as described in P1. + Function arguments of public functions must be bounded. * Producers are encouraged to use unbounded dynamic operations for reasons - described in P4. + described in P4. Result types of these ops aren't required to have a bounded + type. * Producers can still use `get_dimension_size` and `set_dimension_size` ops described in P3 for the ease of transition to StableHLO and faster adoption of StableHLO. From 67ee8506fe03d199fded20a4f0563c054f550141 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Fri, 16 Dec 2022 19:05:25 -0800 Subject: [PATCH 12/17] Improve wording and code examples --- rfcs/20220925-bounded-dynamism.md | 75 ++++++++++++++++++++----------- 1 file changed, 48 insertions(+), 27 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index fe2a33adb69..ff33e8568d6 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -27,12 +27,11 @@ performance optimizations opportunities. ## StableHLO Producers -* Producers should use bounded tensor type representation as described in P1. - Function arguments of public functions must be bounded. -* Producers are encouraged to use unbounded dynamic operations for reasons - described in P4. Result types of these ops aren't required to have a bounded - type. -* Producers can still use `get_dimension_size` and `set_dimension_size` ops +* Producers should use the bounded tensor type representation as described in + P1. Function arguments of public functions should be bounded. +* Producers are encouraged to use the unbounded dynamic operations as described + in P4. Result types of these ops aren't required to have a bounded type. +* Proucers can still use `get_dimension_size` and `set_dimension_size` ops described in P3 for the ease of transition to StableHLO and faster adoption of StableHLO. @@ -56,9 +55,9 @@ performance optimizations opportunities. future. Note that producers that generate unbounded programs don't need this in StableHLO. * Provide a transformation that converts StableHLO programs to bounded StableHLO - programs, if possible. There is a plan to have such a conversion in MHLO and - StableHLO users can utilize that by round tripping to MHLO. Details of this - are outside the scope of this RFC. + programs. There is a plan to have such a conversion in MHLO and StableHLO + users can utilize that by round tripping to MHLO. Details of this are outside + the scope of this RFC. # Detailed Proposal @@ -76,7 +75,7 @@ bound value is static. The following type represents a 2D tensor with up to `3` rows and exactly `5` columns. -``` +```mlir tensor> ``` @@ -87,12 +86,31 @@ that is lower than the static size in the other type are not compatible. The above example type is compatible with `tensor<2x5xf32>` but not with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. -Currently, the StableHLO dialect is using the MLIR core ranked tensor type to -represent bounds. It should be noted that there is a plan to introduce a custom -StableHLO type in the future that could natively support bounds along with -custom pretty printing format. There will be a separate RFC on this. Also, the -proposal will follow StableHLO backward compatibility policies so it is safe to -use this type currently. +```mlir +func.func @bounds_compatability(%arg0: tensor>, + %arg1: tensor>, + %arg2: tensor>, + %arg3: tensor<2xf32>, + %arg4: tensor<4xf32>) { + // %arg0 is compatible with %arg1, %arg2 and %arg3 as bounded types could have + // tensor<2xf32> type during runtime. + %0 = "stablehlo.add"(%arg0, %arg1) : (tensor>, tensor>) -> tensor> + %1 = "stablehlo.add"(%arg0, %arg2) : (tensor>, tensor>) -> tensor> + %2 = "stablehlo.add"(%arg0, %arg3) : (tensor>, tensor<2xf32>) -> tensor<2xf32> + + // This is illegal as operands have incompatible types. %arg0 can either be + // tensor<0xf32>, tensor<1xf32>, tensor<2xf32> or tensor<3xf32> at runtime, + // none of these are compatible with tensor<4xf32> + %3 = "stablehlo.add"(%arg0, %arg4) : (tensor>, tensor<4xf32>) -> tensor<*xf32> + func.return +} +``` + +Currently, the StableHLO dialect uses the MLIR ranked tensor type to represent +bounds. In the future we plan to introduce StableHLO type which supports bounds, +along with a custom pretty printing format. There will be a separate RFC on +this. Also, the proposal will follow StableHLO backward compatibility policies +so it is safe to use this type currently. ## (P2) StableHLO op semantics with bounded operands or results @@ -102,8 +120,8 @@ them. However, the result types need to be compatible with the inferred result types. This allows result types to be more generic or specific as long as it is compatible with the inferred type. -Separately, StableHLO specification will be updated to cover bounded types for -all the relevant ops. +Separately, the StableHLO specification will be updated to cover bounded types +for all the relevant ops. ## (P3) get\_dimension\_size / set\_dimension\_size ops @@ -114,7 +132,7 @@ The following example returns the size of the result after concatenating input that has up to `16` elements with self and returns the actual runtime size of the concatenation result. -``` +```mlir func.func @self_concat_size(%data: tensor>) -> tensor { %concat = "stablehlo.concatenate"(%data, %data) {dimension = 0 : i64} : (tensor>, @@ -146,7 +164,7 @@ static size if the operand shape is static. It is `4` in this example. If the operand dimension is not static, then the returned tensor has same type as the operand. -``` +```mlir func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { %dynamic_data = stablehlo.set_dimension_size %data, %batch_size, dim = 0 : (tensor<4xi32>, tensor) -> tensor> @@ -163,13 +181,16 @@ func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor } ``` -## (P4) Prefer generic dynamic ops over set\_dimension\_size op +## (P4) Comparison with unbounded dynamic ops -Note that in the above example of `@dynamic_sum` function, the same computation -can be performed by using the `real_dynamic_slice` op instead of the -`set_dimension_size` op. The following example demonstrates this. +In addition to `set_dimension_size` and `get_dimension_size` ops, StableHLO +producers may also use unbounded dynamic ops like `real_dynamic_slice` and +`dynamic_pad` to perform operations on dynamically shaped tensors. For example, +the above `@dynamic_sum` computation can be performed by using the +`real_dynamic_slice` op instead of the `set_dimension_size` op. With that, the +above example can be rewritten as, -``` +```mlir func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor { %start = stablehlo.constant dense<0> : tensor<1xi32> %limit = stablehlo.reshape %batch_size : (tensor) -> tensor<1xi32> @@ -199,7 +220,7 @@ can be used instead. The following example demonstrates the differences between programs using unbounded dynamism and bounded dynamism. -``` +```mlir func.func @slice_with_unbounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor { %strides = stablehlo.constant dense<1> : tensor<1xi32> %result = stablehlo.real_dynamic_slice %data, %start, %limit, %strides @@ -209,7 +230,7 @@ func.func @slice_with_unbounded_dynamism(%data: tensor<7xf32>, %start: tensor<1x } ``` -``` +```mlir func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor> { %zero = stablehlo.constant dense<0> : tensor<1xi32> %size = stablehlo.constant dense<7> : tensor<1xi32> From 71ad97693bf18f9a09e4240475e803907ee5bb64 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Mon, 19 Dec 2022 01:19:31 -0800 Subject: [PATCH 13/17] Address review comments from Eugene --- rfcs/20220925-bounded-dynamism.md | 176 ++++++++++++------------------ 1 file changed, 69 insertions(+), 107 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index ff33e8568d6..5fb4628917a 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -6,36 +6,34 @@ dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), including inheriting the type and some ops. This RFC aims to describe the current status and rationale for bounded dynamism constructs in StableHLO and provides recommendations to StableHLO producers and consumers. In particular, -this RFC doesn’t propose any further changes to the current state but this RFC -should still be used to revisit those decisions given that those weren't -discussed publicly. +this RFC doesn’t propose any further changes to the current state. Bounded dynamism allows programs to represent the maximum runtime size that a particular dynamic dimension of a tensor can have. This makes it possible to run -such programs on hardware that don't support dynamic tensors but could support +such programs on hardware that doesn't support dynamic tensors but could support it if the upper bounds of tensor dimensions are known at the compilation time. -With bounded dynamism, real time inference systems don't need to wait for -accumulation up to a particular batch size on such hardware. Bounded dynamism -also makes it possible to support programs whose intermediate tensor shapes -depend on the inputs. For example, -[`stablehlo.dynamic_broadcast_in_dim`](https://github.com/openxla/stablehlo/blob/ff55f9346d54e9e38de807a79f8ae03faffda274/stablehlo/dialect/StablehloOps.td#L1838) -op but with statically known upper bounds of `output_dimensions` operand. Even -on hardware that supports dynamic tensors, bounded dynamism can open up -performance optimizations opportunities. +Applications include: -# Recommendations +* Real time inference without having to wait for accumulation up to a particular + batch size. +* Programs whose intermediate tensor shapes depend on the operands. For example, + [`stablehlo.dynamic_broadcast_in_dim`](https://github.com/openxla/stablehlo/blob/ff55f9346d54e9e38de807a79f8ae03faffda274/stablehlo/dialect/StablehloOps.td#L1838) + op but with statically known upper bounds of `output_dimensions` operand. +* Bounded dynamism can also open up performance optimizations opportunities. -## StableHLO Producers +## Recommendations -* Producers should use the bounded tensor type representation as described in - P1. Function arguments of public functions should be bounded. +### StableHLO Producers + +* Producers can use the bounded tensor type representation as described in + P1. * Producers are encouraged to use the unbounded dynamic operations as described in P4. Result types of these ops aren't required to have a bounded type. -* Proucers can still use `get_dimension_size` and `set_dimension_size` ops +* Producers can still use `get_dimension_size` and `set_dimension_size` ops described in P3 for the ease of transition to StableHLO and faster adoption of StableHLO. -## StableHLO Consumers +### StableHLO Consumers * Consumers should aim to support unbounded programs and can optionally make use of bounds on tensors for optimizations. @@ -46,23 +44,22 @@ performance optimizations opportunities. * Consumers can choose to not support `get_dimension_size` and `set_dimension_size` ops until they have a motivating use-case. -# Non Goals +## Non Goals * Provide [value inference](https://github.com/openxla/xla/blob/9e05932a2ceadea080dc9494cfe9d735f94c4e68/xla/client/value_inference.h) like utility for producers that want to generate the `set_dimension_size` ops. Value inference depends on constant folding for StableHLO ops which is a work - in progress currently. Value inference will be designed separately in the - future. Note that producers that generate unbounded programs don't need this + in progress currently. There will be separate RFC for value inference subject + to separate approval. Note that producers that generate unbounded programs + don't need this in StableHLO. * Provide a transformation that converts StableHLO programs to bounded StableHLO - programs. There is a plan to have such a conversion in MHLO and StableHLO - users can utilize that by round tripping to MHLO. Details of this are outside - the scope of this RFC. - -# Detailed Proposal + programs. However, there is a plan to have such a conversion in MHLO, although + the details of this are out of scope of this RFC. +## Detailed Proposal -## (P1) Bounded tensor type using the encoding field in the RankedTensorType +### (P1) Bounded tensor type using the encoding field in the RankedTensorType Bounds of a dynamic tensor are represented using the `TypeExtensionsAttr` in the `RankedTensorType` encoding field. Bounds in `TypeExtensionsAttr` is an @@ -72,36 +69,33 @@ the IR. Disallowing a static bound value for static dimensions makes the IR canonical and makes it possible to infer that the dimension is dynamic if the bound value is static. -The following type represents a 2D tensor with up to `3` rows and exactly `5` -columns. +The following type represents a 2D tensor, with the size of the 0th dimension +being up to 3 and the size of the 1th dimension being exactly 5 ```mlir -tensor> +tensor> ``` -Type compatibility in StableHLO also checks for compatibility of the bounds. Two -types are compatible if there exists a runtime tensor that could match both the -types. So, two types with different bounds are compatible but a type with bounds -that is lower than the static size in the other type are not compatible. The -above example type is compatible with `tensor<2x5xf32>` but not with +Type compatibility in StableHLO also checks for compatibility of the bounds. +The example type above is compatible with `tensor<2x5xf32>` but not with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. ```mlir -func.func @bounds_compatability(%arg0: tensor>, - %arg1: tensor>, - %arg2: tensor>, +func.func @bounds_compatibility(%arg0: tensor>, + %arg1: tensor>, + %arg2: tensor>, %arg3: tensor<2xf32>, %arg4: tensor<4xf32>) { // %arg0 is compatible with %arg1, %arg2 and %arg3 as bounded types could have // tensor<2xf32> type during runtime. - %0 = "stablehlo.add"(%arg0, %arg1) : (tensor>, tensor>) -> tensor> - %1 = "stablehlo.add"(%arg0, %arg2) : (tensor>, tensor>) -> tensor> - %2 = "stablehlo.add"(%arg0, %arg3) : (tensor>, tensor<2xf32>) -> tensor<2xf32> + %0 = "stablehlo.add"(%arg0, %arg1) : (tensor>, tensor>) -> tensor> + %1 = "stablehlo.add"(%arg0, %arg2) : (tensor>, tensor>) -> tensor> + %2 = "stablehlo.add"(%arg0, %arg3) : (tensor>, tensor<2xf32>) -> tensor<2xf32> // This is illegal as operands have incompatible types. %arg0 can either be // tensor<0xf32>, tensor<1xf32>, tensor<2xf32> or tensor<3xf32> at runtime, // none of these are compatible with tensor<4xf32> - %3 = "stablehlo.add"(%arg0, %arg4) : (tensor>, tensor<4xf32>) -> tensor<*xf32> + %3 = "stablehlo.add"(%arg0, %arg4) : (tensor>, tensor<4xf32>) -> tensor<*xf32> func.return } ``` @@ -110,10 +104,9 @@ Currently, the StableHLO dialect uses the MLIR ranked tensor type to represent bounds. In the future we plan to introduce StableHLO type which supports bounds, along with a custom pretty printing format. There will be a separate RFC on this. Also, the proposal will follow StableHLO backward compatibility policies -so it is safe to use this type currently. +so it is safe to use `TypeExtensionsAttr` currently. - -## (P2) StableHLO op semantics with bounded operands or results +### (P2) StableHLO op semantics with bounded operands or results All ops that support dynamic operands or results can have bounds specified for them. However, the result types need to be compatible with the inferred result @@ -123,14 +116,14 @@ compatible with the inferred type. Separately, the StableHLO specification will be updated to cover bounded types for all the relevant ops. -## (P3) get\_dimension\_size / set\_dimension\_size ops +### (P3) get\_dimension\_size / set\_dimension\_size ops The `get_dimension_size` op takes a tensor and a dimension index and returns the -actual size of the dimension at runtime as an `i32` type scalar. +runtime size as `tensor`. The following example returns the size of the result after concatenating input -that has up to `16` elements with self and returns the actual runtime size of -the concatenation result. +that has up to `16` elements with self and returns the runtime size of the +concatenation result. ```mlir func.func @self_concat_size(%data: tensor>) -> tensor { @@ -146,23 +139,22 @@ func.func @self_concat_size(%data: tensor, %batch_size: tensor) -> tensor { @@ -173,7 +165,7 @@ func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor %sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ ^bb0(%arg1: tensor, %arg2: tensor): %add = stablehlo.add %arg1, %arg2 : tensor - "stablehlo.return"(%add) : (tensor) -> () + stablehlo.return %add : tensor }) {dimensions = dense<[0]> : tensor<1xi64>} : (tensor>, tensor) -> tensor @@ -181,7 +173,7 @@ func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor } ``` -## (P4) Comparison with unbounded dynamic ops +### (P4) Aspirational: Migration to unbounded dynamism In addition to `set_dimension_size` and `get_dimension_size` ops, StableHLO producers may also use unbounded dynamic ops like `real_dynamic_slice` and @@ -203,19 +195,19 @@ func.func @dynamic_sum(%data: tensor<4xi32>, %batch_size: tensor) -> tensor %sum = "stablehlo.reduce"(%dynamic_data, %zero) ({ ^bb0(%arg1: tensor, %arg2: tensor): %add = stablehlo.add %arg1, %arg2 : tensor - "stablehlo.return"(%add) : (tensor) -> () + stablehlo.return %add : tensor }) {dimensions = dense<[0]> : tensor<1xi64>} : (tensor, tensor) -> tensor func.return %sum : tensor } ``` -Originally, XLA HLO introduced `set_dimension_size` op as it neither had dynamic +Originally, HLO introduced `set_dimension_size` op as it neither had dynamic types nor dynamic ops. StableHLO dialect doesn't have these limitations and -therefore new users don't need to make use of this low level op. The TensorFlow -and JAX teams believe that this hypothesis should be correct based on their -experiences so far. Dynamic operations `real_dynamic_slice` and `dynamic_pad` -can be used instead. +therefore new users don't need to make use of this low level op unless they are +moving from HLO or MHLO to StableHLO. The TensorFlow and JAX teams believe that +this hypothesis should be correct based on their experiences so far. Dynamic +operations `real_dynamic_slice` and `dynamic_pad` can be used instead. The following example demonstrates the differences between programs using unbounded dynamism and bounded dynamism. @@ -232,42 +224,18 @@ func.func @slice_with_unbounded_dynamism(%data: tensor<7xf32>, %start: tensor<1x ```mlir func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi32>, %limit: tensor<1xi32>) -> tensor> { - %zero = stablehlo.constant dense<0> : tensor<1xi32> - %size = stablehlo.constant dense<7> : tensor<1xi32> - - // Conditionally add size of the input data if the indices are negative. - %is_start_neg = stablehlo.compare LT, %start, %zero - : (tensor<1xi32>, tensor<1xi32>) -> tensor<1xi1> - %non_negative_start = "stablehlo.if"(%is_start_neg) ({ - %1 = stablehlo.add %start, %size : tensor<1xi32> - "stablehlo.return"(%1) : (tensor<1xi32>) -> () - }, { - "stablehlo.return"(%start) : (tensor<1xi32>) -> () - }) : (tensor<1xi1>) -> tensor<1xi32> - - %is_limit_neg = stablehlo.compare LT, %limit, %zero - : (tensor<1xi32>, tensor<1xi32>) -> tensor<1xi1> - %non_negative_limit = "stablehlo.if"(%is_limit_neg) ({ - %1 = stablehlo.add %limit, %size : tensor<1xi32> - "stablehlo.return"(%1) : (tensor<1xi32>) -> () - }, { - "stablehlo.return"(%limit) : (tensor<1xi32>) -> () - }) : (tensor<1xi1>) -> tensor<1xi32> - // Add padding to avoid OOM access in the following slice op. %pad_value = stablehlo.constant dense<0.0> : tensor %padded_data = stablehlo.pad %data, %pad_value, low = [0], high = [7], interior = [0] : (tensor<7xf32>, tensor) -> tensor<14xf32> // Extract the largest possible slice starting at the start index. - %scalar_start = stablehlo.reshape %non_negative_start - : (tensor<1xi32>) -> tensor + %scalar_start = stablehlo.reshape %start : (tensor<1xi32>) -> tensor %padded_result = stablehlo.dynamic_slice %padded_data, %scalar_start, sizes = [7] : (tensor<14xf32>, tensor) -> tensor<7xf32> // Remove the extra elements extracted beyond the limit. - %slice_size = stablehlo.subtract %non_negative_limit, %non_negative_start - : tensor<1xi32> + %slice_size = stablehlo.subtract %limit, %start : tensor<1xi32> %scalar_size = stablehlo.reshape %slice_size : (tensor<1xi32>) -> tensor %result = stablehlo.set_dimension_size %padded_result, %scalar_size, dim = 0 : (tensor<7xf32>, tensor) @@ -281,13 +249,7 @@ func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi3 Use of dynamic ops over `set_dimension_size` op has various benefits: * Greatly simplifies the lowering from higher level frameworks to StableHLO as - they don't need to make use of the value inference to compute bounds or - generate low level ops. -* Opens up the opportunity to share conversion to bounded programs between - frameworks and compilers. Therefore, frameworks can immediately target new - hardware requiring bounded programs even if they didn't already support that. - Data dependent bounded dynamism won't require any changes and input dependent - bounded dynamism can be supported by just specifying the bounds on the inputs. + they don't need to compute upper bounds of dynamic dimensions. * Makes lowerings to StableHLO hardware agnostic as they don't depend on if the hardware requires unbounded or bounded programs. * Reduces the potential confusion in making use of `set_dimension_size` as the @@ -299,11 +261,11 @@ updates. However, compilers should be making the trade-off between copy and additional memory based on the hardware capabilities and not the frameworks. It is possible to lower slice op to `set_dimension_size` op easily but going in the other direction is tricky. That would require program analysis to make sure that -the logical size of the buffer is not increased later on. +the size of the buffer is not increased later on. -# Alternatives Considered +## Alternatives Considered -## Not having bounded type and/or set\_dimension\_size op +### Not having bounded type and/or set\_dimension\_size op Given the recommendation of using unbounded dynamism, could StableHLO just have a function attribute to store the input bounds instead of having bounded type From db307908728c51addc3967f8c04ab021a444d13f Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Mon, 19 Dec 2022 17:44:01 -0800 Subject: [PATCH 14/17] Address review comments --- rfcs/20220925-bounded-dynamism.md | 73 ++++++++++--------------------- 1 file changed, 24 insertions(+), 49 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 5fb4628917a..24dafc896ec 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -4,9 +4,8 @@ expresses ML computations. It has been originally bootstrapped from the [MHLO dialect](https://github.com/tensorflow/mlir-hlo#meta-hlo-dialect-mhlo), including inheriting the type and some ops. This RFC aims to describe the -current status and rationale for bounded dynamism constructs in StableHLO and -provides recommendations to StableHLO producers and consumers. In particular, -this RFC doesn’t propose any further changes to the current state. +current status and rationale for bounded dynamism constructs in StableHLO. In +particular, this RFC doesn’t propose any further changes to the current state. Bounded dynamism allows programs to represent the maximum runtime size that a particular dynamic dimension of a tensor can have. This makes it possible to run @@ -21,29 +20,6 @@ Applications include: op but with statically known upper bounds of `output_dimensions` operand. * Bounded dynamism can also open up performance optimizations opportunities. -## Recommendations - -### StableHLO Producers - -* Producers can use the bounded tensor type representation as described in - P1. -* Producers are encouraged to use the unbounded dynamic operations as described - in P4. Result types of these ops aren't required to have a bounded type. -* Producers can still use `get_dimension_size` and `set_dimension_size` ops - described in P3 for the ease of transition to StableHLO and faster adoption of - StableHLO. - -### StableHLO Consumers - -* Consumers should aim to support unbounded programs and can optionally make use - of bounds on tensors for optimizations. -* Consumers that support unbounded programs can safely ignore the bounds - completely without affecting the correctness. -* Consumers that only support bounded programs could first transform the given - program to a bounded one through program analysis. -* Consumers can choose to not support `get_dimension_size` and - `set_dimension_size` ops until they have a motivating use-case. - ## Non Goals * Provide [value inference](https://github.com/openxla/xla/blob/9e05932a2ceadea080dc9494cfe9d735f94c4e68/xla/client/value_inference.h) @@ -70,15 +46,16 @@ canonical and makes it possible to infer that the dimension is dynamic if the bound value is static. The following type represents a 2D tensor, with the size of the 0th dimension -being up to 3 and the size of the 1th dimension being exactly 5 +being up to 3 and the size of the 1st dimension being exactly 5: ```mlir tensor> ``` Type compatibility in StableHLO also checks for compatibility of the bounds. -The example type above is compatible with `tensor<2x5xf32>` but not with -`tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first dimension. +For example, the example type above is compatible with `tensor<2x5xf32>` but not +with `tensor<7x5xf32>` as it doesn’t respect the bound `3` on the first +dimension. ```mlir func.func @bounds_compatibility(%arg0: tensor>, @@ -256,28 +233,26 @@ Use of dynamic ops over `set_dimension_size` op has various benefits: users are generally not familiar with this op and also the semantics are not intuitive. -It is true that the `set_dimension_size` semantics allows making in-place -updates. However, compilers should be making the trade-off between copy and -additional memory based on the hardware capabilities and not the frameworks. It -is possible to lower slice op to `set_dimension_size` op easily but going in the -other direction is tricky. That would require program analysis to make sure that -the size of the buffer is not increased later on. +Benefit of the `set_dimension_size` op: + +Given that the runtime size argument of `set_dimension_size` op is required to +be less than or equal to the static size or bound, compiler could separately +track runtime size of the tensor and keep a buffer of fixed size according to +the bound. This helps avoid any data movements for the `set_dimension_size` op +at the cost of extra memory. However, compilers should be making the trade-off +between copy and additional memory based on the hardware capabilities and not +the frameworks. It is possible to lower slice op to `set_dimension_size` op +easily but going in the other direction is tricky. That would require program +analysis to make sure that the size of the buffer is not increased later on. ## Alternatives Considered ### Not having bounded type and/or set\_dimension\_size op -Given the recommendation of using unbounded dynamism, could StableHLO just have -a function attribute to store the input bounds instead of having bounded type -and `set_dimension_size` op? This might be possible but this will pose -significant challenges for existing users generating bounded programs. Current -proposal allows users to incrementally move to the recommended approach for new -implementations while immediately making use of StableHLO without generating a -mix of StableHLO and MHLO programs. - -It is true that having the bounded type and `set_dimension_size` op introduces -some complexity but given that the bounds are optional, users that don't care -about bounded dynamism don't encounter complexity. All the code complexity is -limited to the StableHLO shape functions. These also affects the op -specifications but these should be intuitive to users based on the op semantics -and making use of StableHLO shape functions should hide that as well. +Given the use of unbounded dynamism, could StableHLO just have a function +attribute to store the input bounds instead of having bounded type and +`set_dimension_size` op? This might be possible but this will pose significant +challenges for existing users generating bounded programs. Current proposal +allows users to incrementally move to unbounded dynamism for new implementations +while immediately making use of StableHLO without generating a mix of StableHLO +and MHLO programs. From d10d65457b721936912a765ca68654ba6dea2330 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Tue, 20 Dec 2022 03:48:08 -0800 Subject: [PATCH 15/17] Address Milad's comments --- rfcs/20220925-bounded-dynamism.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 24dafc896ec..657ac583da9 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -223,15 +223,12 @@ func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi3 ``` -Use of dynamic ops over `set_dimension_size` op has various benefits: +Use of dynamic ops over `set_dimension_size` op has a couple of benefits: * Greatly simplifies the lowering from higher level frameworks to StableHLO as they don't need to compute upper bounds of dynamic dimensions. * Makes lowerings to StableHLO hardware agnostic as they don't depend on if the hardware requires unbounded or bounded programs. -* Reduces the potential confusion in making use of `set_dimension_size` as the - users are generally not familiar with this op and also the semantics are not - intuitive. Benefit of the `set_dimension_size` op: From c01f56848f896f9ff207127979605e492dc160fd Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Tue, 20 Dec 2022 14:52:43 -0800 Subject: [PATCH 16/17] Clarify P4 benefits section is about unbounded dynamism --- rfcs/20220925-bounded-dynamism.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 657ac583da9..3137f04a5c2 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -223,7 +223,8 @@ func.func @slice_with_bounded_dynamism(%data: tensor<7xf32>, %start: tensor<1xi3 ``` -Use of dynamic ops over `set_dimension_size` op has a couple of benefits: +Use of unbounded dynamic ops over `set_dimension_size` op has a couple of +benefits: * Greatly simplifies the lowering from higher level frameworks to StableHLO as they don't need to compute upper bounds of dynamic dimensions. From 120449b0242c75ef47f7c6327261abb47d1e1004 Mon Sep 17 00:00:00 2001 From: smit-hinsu Date: Wed, 21 Dec 2022 18:45:40 -0800 Subject: [PATCH 17/17] Replace hardware with platform for bounded dynamism motivation --- rfcs/20220925-bounded-dynamism.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/20220925-bounded-dynamism.md b/rfcs/20220925-bounded-dynamism.md index 3137f04a5c2..f49643a94be 100644 --- a/rfcs/20220925-bounded-dynamism.md +++ b/rfcs/20220925-bounded-dynamism.md @@ -9,7 +9,7 @@ particular, this RFC doesn’t propose any further changes to the current state. Bounded dynamism allows programs to represent the maximum runtime size that a particular dynamic dimension of a tensor can have. This makes it possible to run -such programs on hardware that doesn't support dynamic tensors but could support +such programs on platforms that don't support dynamic tensors but could support it if the upper bounds of tensor dimensions are known at the compilation time. Applications include: