[RFC] Custom ops support #1462

silvasean · 2022-10-05T11:43:45Z

Hey folks, custom ops has been a common feature request, and I'd like to share a design that we've been thinking about.

Let's talk first about the functionality we want to unlock for users. Generally speaking, custom ops are a way to "tunnel" information from the frontend to the backend. This can be something like a custom fusion, or something more advanced, like a representation of quantization or sparsity. Note that ops can also carry various different types of information, such as !torch.int, !torch.string, etc. (which we would generally need to be constrained to statically known values to be useful in the compiler). So an op like my.custom_quantize %foo, %quantization_info, ... etc. could actually carry a fairly elaborate representation of quantization/sparsity/etc. in the %quantization_info string.

The key requirement here is the desire to support new ops without recompiling Torch-MLIR. We've had some initial efforts in the custom op direction, but they always required recompiling Torch-MLIR. Put another way, the goal is to generalize the current way of handling ops so that there is no "hardcoded C++" for each op. Here by "support new ops" we really mean lowering to the backend contract (see architecture.md). Lowering to Linalg/etc. generally is out of scope of this effort, since a different set of technology is involved.

In our longer-term roadmap (see slide) of aligning with TorchDynamo and upstream PyTorch, we will directly import into the backend contract, so all of this work on custom ops will be obsoleted. But the indication is that this longer-term roadmap will take 1-2 years to fully pan out, while we want to unlock this user requirement in a shorter timeframe (ideally months).

When lowering to the backend contract, there are 4 main responsibilities (see architecture.md):

All tensors have been converted to value semantics.
All tensors have at least a known number of dimensions (i.e. rank), and ideally also have a precise size for each dimension.
All tensors have a known dtype.
Certain ops have been decomposed to make them easier to handle (this is configurable).

Let's go over these one by one:

Conversion to value semantics. This is a very tricky conversion (see MaximizeValueSemantics pass, test). From a design perspective, I don't see any option other than to require all custom ops be value-semantic. This seems like it will be enough for the use cases that the community has presented so far.
Shape inference. We already do this with the Shape Library. The shape functions are written in a Python script that is then used to generate the shape functions as MLIR. Since the shape functions are represented as MLIR snippets, we can (conceptually) already load new shape functions from a .mlir file on the side without recompilation. So there is no big architectural change here.
Dtype inference. This currently happens in a hardcoded C++ pass RefineTypes. This pass needs to be rewritten to not require recompiling to support new ops. We can follow the same design as the shape library -- reify, abstractly interpret, drop. This will be a total rewrite of this component.
Decompositions. We currently have thousands of lines of decompositions in DecomposeComplexOps. We should be able to follow a similar approach as the shape library, and basically just replace aten.foo ... with call @aten.foo$decomposition .... PyTorch already maintains many such decompositions (here and here), so this will also allow unifying efforts with those upstream, which is a big win by itself.

The overall design that we are left with is that we need to generalize the shape library infrastructure to encompass dtype refinement rules and decompositions. The system could look something like this, which is a generalization of shape_lib_gen.py:

def aten〇tanh〡shape(self: List[int]) -> List[int]:
    return upstream_shape_functions.unary(self)
def aten〇tanh〡dtype(self_shape: List[int], self_dtype: int) -> int:
    return dtype_functions.unary(self_shape, self_dtype)
# This is just an example -- we would probably not want this decomposition for tanh.
# Decompositions would be optional, but if provided would be available for use.
def aten〇tanh〡decomposition(self: torch.Tensor) -> torch.Tensor:
    exp, expn = torch.exp(self), torch.exp(-self)
    return (exp - expn) / (exp + expn)

We would process this and generate a .mlir file similar to the current shape library, which would be usable to service the goals listed above. When users want to support custom ops, they will write such functions, and use tools provided by Torch-MLIR to import that into a .mlir file and then include those for the passes that need them.

The main work items here are:

Generalize shape library testing framework to work with dtype functions and decompositions
Migrate PyTorch’s type promotion logic (currently hardcoded in refine types) to Python so that dtype functions can use it
Generalize (and potentially improve) the abstract interpretation code from the shape library to also work with dtype calculations and decompositions
Create an interface for users to load their custom ops: $ torch-mlir-opt -pass-pipeline='torchscript-module-to-torch-backend-pipeline{extra-library=my_custom_ops.mlir}'

Thoughts or feedback?

The text was updated successfully, but these errors were encountered:

powderluv · 2022-10-07T13:55:23Z

I assume we will also add the python wrapper to point to the custom ops from python ?

ramiro050 · 2022-10-08T00:51:12Z

I assume we will also add the python wrapper to point to the custom ops from python ?

Yeah, the Torch-MLIR Python API will have a way to specify custom ops file.

The original design for the dtype functions outlined in llvm#1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype functions for the convolution ops, all of which take one optional tensor as an argument.

…1865) The original design for the dtype functions outlined in #1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype functions for the convolution ops, all of which take one optional tensor as an argument.

makslevental · 2023-03-22T15:44:32Z

Create an interface for users to load their custom ops: $ torch-mlir-opt -pass-pipeline='torchscript-module-to-torch-backend-pipeline{extra-library=my_custom_ops.mlir}'

Another way to implement this is to enable/allow users to provide these shape and dtype functions in the same module that they provide their func.func @forward(.... I show a minimal implementation in #1959.

ramiro050 · 2023-03-22T16:18:56Z

Another way to implement this is to enable/allow users to provide these shape and dtype functions in the same module that they provide their func.func @forward(.... I show a minimal implementation in #1959.

I have a couple of questions. The first is regarding the user experience. How would the interface for users work to add their dtype functions into their module? Ideally we want to avoid users having to directly modify an MLIR module at the Python level to attach more MLIR.

The second question is regarding how things are expected to work internally in torch-mlir. Carrying around the shape+dtype functions in the module means that now every pass in torch-mlir will also be spending time applying to shape+dtype functions, which would increase lowering time (might not be a concern in practice, but it's something to think about). The second issue I foresee is that shape+dtype functions are not required to satisfy the backend contract. Carrying around the functions in the module would mean that we now have to have special casing in the LowerToBackendContract pass to not consider those functions. Lastly, carrying around many shape+dtype functions would make reading MLIR dumps harder, since passes that have nothing to do with type refinement would be also dumping the shape+dtype functions, worsening the debugging experience.

What are the downsides from the current proposed design you're trying to solve with this new design?

makslevental · 2023-03-22T16:43:33Z

What are the downsides from the current proposed design you're trying to solve with this new design?

There is no current design - there is only an idea for an API.

Create an interface for users to load their custom ops: $ torch-mlir-opt -pass-pipeline='torchscript-module-to-torch-backend-pipeline{extra-library=my_custom_ops.mlir}'

I can certainly envisage how that would be implemented but it's not here yet.

Next;

Ideally we want to avoid users having to directly modify an MLIR module at the Python level to attach more MLIR

There's some premise/assumption implicit here that I don't understand/agree with - there's simultaneously mentions of the python API:

Torch-MLIR's only supported "stable" interface is the torch_mlir.compile

and the proposal here (re: going through torch-mlir-opt). So which is it? Is torch-mlir Python first or CLI first? Regardless, my answer is: we have IR so you can pass textual source between passes (instead requiring transforming one single AST), i.e., to decouple passes for exactly this kind of thing - run a pass, fiddle, run another pass, etc. Whether that fiddling happens through the Python bindings or by handwriting a my_custom_ops.mlir is irrelevant (how would people create this my_custom_ops.mlir if not through Python anyway...). Indeed, for custom ops (wrt shape/dtype), my PR would be moot if the implementation of this API existed today because I would just generate those shape functions using Python bindings and print(module) and then pass them that to extra-library.

Next;

The second issue I foresee is that shape+dtype functions are not required to satisfy the backend contract.

Bu they are through TensorStaticInfoCastOp; because even though torch.operator carries a non-valsem type (and thus doesn't trip LowerToBackendContract.cpp#L91), tensor_static_info_casts are generated around it and those thwart passing the backend contract.

Finally;

Carrying around the shape+dtype functions in the module means that now every pass in torch-mlir will also be spending time applying to shape+dtype functions, which would increase lowering time (might not be a concern in practice, but it's something to think about)... Lastly, carrying around many shape+dtype functions would make reading MLIR dumps harder...

Sure, but we should weigh the costs/benefits of increased compile times and decreased legibility against today's users' needs wrt lowering models which include currently unsupported ops.

silvasean · 2023-03-22T19:09:35Z

Hey folks, it looks like the main stakeholders here are happy with pushing this feature through as we have specced here. I created another issue to help shepherd this work to completion at a detailed work level. Let's move further discussion there: #1963

The original design for the dtype functions outlined in llvm#1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype function for convolution op, which takes one optional tensor as an argument.

…1965) The original design for the dtype functions outlined in #1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype function for convolution op, which takes one optional tensor as an argument.

commit bafe339 Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon May 8 21:26:56 2023 +0000 Add dtype functions for aten.atan and prims.squeeze commit bebf695 Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon May 8 21:26:10 2023 +0000 Remove duplicate code from merge with main commit 0d11895 Author: Ramiro Leal-Cavazos <[email protected]> Date: Fri May 5 21:39:02 2023 +0000 Update LLVM tag commit 73d5c07 Merge: 899d8bc eaaaeb6 Author: Ramiro Leal-Cavazos <[email protected]> Date: Fri May 5 21:30:09 2023 +0000 Merge remote-tracking branch 'upstream/main' into merge-main commit 899d8bc Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon Mar 13 21:39:14 2023 +0000 Add dtype functions for `aten.ge.Tensor` and `aten.le.Tensor` commit f58f9c2 Merge: ce7abf4 4912c39 Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon Mar 13 21:32:00 2023 +0000 Merge branch 'main' into merge-main commit ce7abf4 Author: Jiahao Li <[email protected]> Date: Wed Feb 22 06:54:41 2023 +0800 Add dtype functions for ops that take dtype from 2nd operand (llvm#1891) commit 63945a2 Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon Feb 13 17:56:09 2023 -0800 Change dtype functions interface to take ints tuple for each tensor (llvm#1865) The original design for the dtype functions outlined in llvm#1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype functions for the convolution ops, all of which take one optional tensor as an argument. commit 981ac88 Author: Ramiro Leal-Cavazos <[email protected]> Date: Wed Feb 1 22:30:27 2023 +0000 Add dtype functions for two tensor promotion ops (llvm#1831) This commit adds dtype functions for ops in RefineTypes under the category of "Promote the two dtypes". The only ops not added here are convolution ops, since they take an optional tensor argument, and the dtype pipeline currently does not correctly handle that case. I will add a follow up patch fixing this. This commit also adds two helper functions that perform a very thorough testing of dtype functions. The helper function `_check_two_tensor_op` is able to independently test invalid input dtypes and invalid output dtypes. Lastly, this commit also XFAILs "MobilenetV3Module_basic". commit 83d4e89 Author: Jiahao Li <[email protected]> Date: Sat Jan 21 02:39:41 2023 +0800 Add dtype functions for floating point ops (llvm#1813) commit 8cae5ba Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon Jan 16 14:32:23 2023 -0800 Add dtype functions for comparison ops (llvm#1806) This commit adds dtype functions for comparison ops that always return a tensor of dtype `i1`. commit 5b77c15 Author: Ramiro Leal-Cavazos <[email protected]> Date: Mon Jan 16 20:27:49 2023 +0000 Add CI to `dtype-functions-staging` branch commit ac94ba2 Author: Ramiro Leal-Cavazos <[email protected]> Date: Thu Jan 12 22:41:04 2023 +0000 Move dtype functions into their own section in lib gen file In order to easily keep track of the dtype functions that have been moved to `abstract_interp_lib_gen.py` and make it easier to add new ones, this commit groups all the dtype functions together, rather than having them interspersed between the shape functions.

…lvm#1965) The original design for the dtype functions outlined in llvm#1462 was unable to properly handle ops that take optional tensors as an input when the optional tensor has a value of None. By the time the op gets imported into torch-mlir, if an optional value is None, all information about the original type is lost from the op type signature, preventing torch-mlir from knowing if a value of None was from an optional tensor or not, which was crucial in the original design since each tensor argument must be turned into two separate arguments for the dtype function. This commit changes the interface to dtype functions such that each tensor turns into a tuple of two ints, the first representing the rank of the tensor and the second the dtype of the tensor. Since now there is a one-to-one correspondence between the operands of an op and the operands of its dtype function, there is no ambiguity about which operand of the op corresponds with which operand of the dtype function. To test the implementation, this commit defines dtype function for convolution op, which takes one optional tensor as an argument.

igozali · 2023-06-28T21:53:32Z

Apologies for the very silly question, but wondering where I can find the rationale/RFC/discussion for using the unicode character 〇 in the function names above?

powderluv · 2023-06-29T02:00:25Z

Its a cust〇m 〇p :)

powderluv · 2023-06-29T02:00:56Z

And I think can be closed now that it is implemented.

ramiro050 · 2023-06-29T14:30:51Z

@igozali, there is some information about why use the character here:

torch-mlir/docs/adding_abstract_interpretation_functions.md

Lines 26 to 27 in 449cfb8

    
           Note the use of `〇` as a separator since `.` or `::` aren't legal 
        
           in a Python identifier.

and the rationale for using the entire operator name for each function here.

In short, we wanted to keep the full operator name with namespace information in a way that would then be easily parsable by the Python script that generates the MLIR snippets, so that in MLIR it is easy to know which abstract interpretation function goes with which op. If we had resorted to using normal characters (letters and _), there would be several ambiguous cases, because some ops have _ in their names.

silvasean mentioned this issue Oct 7, 2022

How to support self-define torch op? #1212

Closed

Svoch mentioned this issue Oct 7, 2022

QPyTorch Support (BFP quantization) #910

Open

This was referenced Oct 7, 2022

Brainstorm on how to add custom op authoring #462

Closed

A prototype of jit.ignore support. #504

Closed

Svoch mentioned this issue Oct 24, 2022

[RFC] Selective Op decomposition: Propagate down backend-legal Ops as Custom Ops #1519

Open

ramiro050 mentioned this issue Jan 16, 2023

Push for finishing transition to Python dtype functions #1807

Closed

23 tasks

ramiro050 mentioned this issue Feb 9, 2023

Change dtype functions interface to take ints tuple for each tensor #1865

Merged

ramiro050 mentioned this issue Mar 17, 2023

LowerToBackendContract: Explicitly error out on unimplemented operator #1947

Merged

silvasean mentioned this issue Mar 21, 2023

RFC: Support pushing custom ops through backend-contract using torch.operator #1959

Closed

silvasean mentioned this issue Mar 22, 2023

Custom Ops Sprint #1963

Closed

4 tasks

ramiro050 mentioned this issue Mar 22, 2023

Change dtype functions interface to take ints tuple for each tensor #1965

Merged

powderluv closed this as completed Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Custom ops support #1462

[RFC] Custom ops support #1462

silvasean commented Oct 5, 2022

powderluv commented Oct 7, 2022

ramiro050 commented Oct 8, 2022

makslevental commented Mar 22, 2023

ramiro050 commented Mar 22, 2023

makslevental commented Mar 22, 2023 •

edited

Loading

silvasean commented Mar 22, 2023

igozali commented Jun 28, 2023

powderluv commented Jun 29, 2023

powderluv commented Jun 29, 2023

ramiro050 commented Jun 29, 2023

[RFC] Custom ops support #1462

[RFC] Custom ops support #1462

Comments

silvasean commented Oct 5, 2022

powderluv commented Oct 7, 2022

ramiro050 commented Oct 8, 2022

makslevental commented Mar 22, 2023

ramiro050 commented Mar 22, 2023

makslevental commented Mar 22, 2023 • edited Loading

silvasean commented Mar 22, 2023

igozali commented Jun 28, 2023

powderluv commented Jun 29, 2023

powderluv commented Jun 29, 2023

ramiro050 commented Jun 29, 2023

makslevental commented Mar 22, 2023 •

edited

Loading