Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support packed i1 storage through encodings #19220

Open
hanhanW opened this issue Nov 20, 2024 · 5 comments · May be fixed by #19354
Open

[Feature Request] Support packed i1 storage through encodings #19220

hanhanW opened this issue Nov 20, 2024 · 5 comments · May be fixed by #19354
Assignees
Labels
codegen Shared code generation infrastructure and dialects

Comments

@hanhanW
Copy link
Contributor

hanhanW commented Nov 20, 2024

c80fa3b adds the support for packing i1 back-to-back in storage behind an experimental flag. We've been struggling how to switch between two storage modes, and recently we found that the encoding can serve the needs.

There are two places that need to take the encoding into account. One is the EncodeHostTensor pass, where the stream.tensor.sizeof op takes a tensor with optional encoding and get lowered to storage size calculation (in bytes). The other is the TypePropagation pass, which converts types in dispatches for codegen.

I think we can introduce either a new op_type, which can be i1_packed_storage, to EncodingAttr, or introduce an new EncodingStorageAttr attribute that describes what the storage layout is. Either way works.

For the host part, we'll need to update calculateStorageElementCountInBytes method to take the encoding into account. It should behave like the flag is turned on. And we potentially need to update the calculateStorageElementOffsetInBytes method as well, idk at this moment.

For the device side, we need to update the TypePropagation pass. We can either refresh the pass or update the legalizeStorageElementType method which is used in the pass.

Note that we do not need attribute interface, so this is not blocked by my encoding prototype (i.e., data-tiling + multi-device). @lialan let me know if something is not clear. We can chat on VC.

@hanhanW hanhanW added the codegen Shared code generation infrastructure and dialects label Nov 20, 2024
@benvanik
Copy link
Collaborator

I suggest an entirely new EncodingStorageAttr - we don't want to put too much on the existing one - we have a whole dialect and can use it, and because this is something that will be user-facing all the way up to the framework it's important to keep it simple and self contained so people can construct it in python.

@hanhanW
Copy link
Contributor Author

hanhanW commented Nov 20, 2024

I suggest an entirely new EncodingStorageAttr - we don't want to put too much on the existing one - we have a whole dialect and can use it, and because this is something that will be user-facing all the way up to the framework it's important to keep it simple and self contained so people can construct it in python.

SGTM, I'm also concerned that we already put so many burdens on the existing one. So having a separate attribute sounds good to me.

@hanhanW
Copy link
Contributor Author

hanhanW commented Dec 10, 2024

@lialan is making some progress on https://github.com/iree-org/iree/tree/lialan/i1_attr I haven't looked at all the details, but there are some issues on host side.

Let me hightlight the requirement of the work:

  • The i1 type without encodings should be treated as i8 type. I.e., each i1 value is stored in an i8 container.
  • The i1 type with the encoding should be treated as i1 type. I.e., they are packed back-to-back in memory.

Firstly, it could fail in MaterializeHomogeneousEncoding pass. We can just disable data-tiling for now. But the fix should be trivial. We just need to update the type converter target logic. Instead of checking whether encoding is present or not, it should check whether the encoding is EncodingAttr or not. Then the PackedStorageAttr should work.

markUnknownOpDynamicallyLegal([](Operation *op) {
auto typeHasEncoding = [](Type t) -> bool {
auto tensorType = dyn_cast<RankedTensorType>(t);
return tensorType && tensorType.getEncoding();
};
auto valueHasEncoding = [=](Value v) -> bool {
return typeHasEncoding(v.getType());
};
bool hasOperandOrResultsWithEncoding =
llvm::any_of(op->getOperands(), valueHasEncoding) ||
llvm::any_of(op->getResultTypes(), typeHasEncoding);
return !hasOperandOrResultsWithEncoding;
});

Anyway, there are more issues at Stream level. @lialan I suggest that we start with the below MLIR file. It should give us a POC how packed i1 should work in IREE. It is similar to what you shared with me. Context: the container used in constants always store an i1 in an i8 container. This is just a workaround for the prototype.

Compilation command: iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=znver4 --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu ~/repro.mlir -o /tmp/z.vmfb. (You can use whatever CPU target.)

#packed = #iree_encoding.packed_storage
func.func @i1_type() -> tensor<8xi8> {
  %c0 = arith.constant 0 : index
  %c255 = arith.constant 255 : i8
  %input1 = util.unfoldable_constant dense<[85]> : tensor<1xi8>  // b01010101
  %input2 = util.unfoldable_constant dense<[170]> : tensor<1xi8> // b10101010
  %lhs = flow.tensor.bitcast %input1 : tensor<1xi8> -> tensor<8xi1, #packed>
  %rhs = flow.tensor.bitcast %input2 : tensor<1xi8> -> tensor<8xi1, #packed>
  %empty = tensor.empty() : tensor<8xi8>
  %res = linalg.generic
        {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]}
        ins(%lhs, %rhs : tensor<8xi1, #packed>, tensor<8xi1, #packed>) outs(%empty: tensor<8xi8>) {
  ^bb0(%inlhs: i1, %inrhs: i1, %out: i8):
    %inres = arith.xori %inlhs, %inrhs: i1
    %ext = arith.extui %inres : i1 to i8
    linalg.yield %ext : i8
  } -> tensor<8xi8>
  return %res : tensor<8xi8>
}

The first issue is that some stream.tensor.clone ops are created in the ConvertToStream pass. It does not allow tensor types with encoding at this moment. The workaround is disabling the check in the prototype. We should reach out to Ben to see how to fix it properly.

With the below patch, we can bypass the error.

diff --git a/compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp b/compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp
index 3c898954ea..8896590e34 100644
--- a/compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp
+++ b/compiler/src/iree/compiler/Dialect/Stream/IR/StreamOps.cpp
@@ -1508,6 +1508,7 @@ bool TensorSplatOp::preferCloneToConsumers() { return true; }
 
 LogicalResult TensorCloneOp::verify() {
   TensorCloneOp op = *this;
+#if 0
   // Clones can't change encodings but they can change shape and element type
   // information.
   auto sourceEncoding = llvm::cast<RankedTensorType>(op.getSourceEncoding());
@@ -1517,6 +1518,7 @@ LogicalResult TensorCloneOp::verify() {
                             << sourceEncoding.getEncoding() << " to "
                             << resultEncoding.getEncoding() << "; not allowed";
   }
+#endif
   if (failed(verifyOpDynamicDims(op, op.getSourceEncoding(),
                                  op.getSourceEncodingDims())) ||
       failed(verifyOpDynamicDims(op, op.getResultEncoding(),

The next step is how to generate "good IR" in EncodeDeviceTensor pass. I thought that the device code update could be done at TypePropagation pass, and I forgot the EncodeDeviceTensor pass, which is still at Stream phase.

Looking at the requirement, I found a difficulty here. The fact is that i1 types will be converted to i8 types in the pass. If we legalize the i1 types with encoding to tensor i1 types, it will further be legalized to i8 type. Ideally, we should set up a proper dialect conversion for it. We can dynamically mark it legal, so the legalized i1 types won't be push into the worklist; it becomes i1 type. Currently, the pass does not use the dialect conversion driver, so it could take some time to set it up.

The other solution is that we can legalize regular i1 types to i8 types in the pass, and remain the i1 types with encoding as what it is. In the codegen TypePropagation pass, we do not update any regular i1 types, and we legalize the i1 types with encodings to regular i1 types. I.e., we drop the encoding in the TypePropagation pass. With this way, the host resolves the encodings at phase level. And the device resolves the i1 storage problem in two step. One is legalizing i1 types to i8 types in EncodeDeviceTensorPass, and the other is legalizing i1 types with encodings to regular i1 types in TypePropagation pass. Then we get what we want at both host and device side.

In the first version, I think we can go with the latter solution, and work towards the ideal solution later on. This will give us the POC, and the frontend side can start their work. Meanwhile, we can work on the former solution. @lialan does it sound good?

=====

To be more specific, here is the IR before EncodeDeviceTensor pass:

#map = affine_map<(d0) -> (d0)>
stream.executable private @i1_type_dispatch_0 {
  stream.executable.export public @i1_type_dispatch_0_elementwise_8_i1xi1xi8 workgroups() -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_slice
    stream.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @i1_type_dispatch_0_elementwise_8_i1xi1xi8(%arg0: !stream.binding, %arg1: !stream.binding, %arg2: !stream.binding) {
      %c0 = arith.constant 0 : index
      %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1, #iree_encoding.packed_storage>>
      %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<8xi1, #iree_encoding.packed_storage>>
      %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<8xi8>>
      %3 = flow.dispatch.tensor.load %0, offsets = [0], sizes = [8], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1, #iree_encoding.packed_storage>> -> tensor<8xi1, #iree_encoding.packed_storage>
      %4 = flow.dispatch.tensor.load %1, offsets = [0], sizes = [8], strides = [1] : !flow.dispatch.tensor<readonly:tensor<8xi1, #iree_encoding.packed_storage>> -> tensor<8xi1, #iree_encoding.packed_storage>
      %5 = tensor.empty() : tensor<8xi8>
      %6 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%3, %4 : tensor<8xi1, #iree_encoding.packed_storage>, tensor<8xi1, #iree_encoding.packed_storage>) outs(%5 : tensor<8xi8>) {
      ^bb0(%in: i1, %in_0: i1, %out: i8):
        %7 = arith.xori %in, %in_0 : i1
        %8 = arith.extui %7 : i1 to i8
        linalg.yield %8 : i8
      } -> tensor<8xi8>
      flow.dispatch.tensor.store %6, %2, offsets = [0], sizes = [8], strides = [1] : tensor<8xi8> -> !flow.dispatch.tensor<writeonly:tensor<8xi8>>
      return
    }
  }
}

The output Ir that we're looking for should be the same. And we resolve the encoding attribute in the TypePropagationPass.

@benvanik
Copy link
Collaborator

I may need a meeting to follow all of that (I'm kind of lost) - I'm not quite sure what if any issues there are, but the check in the clone op is mandatory (clones cannot change contents) and if you're hitting issues there it likely indicates that the approach has an issue. Maybe this is what you mean by cutting in or something but I'm not sure that's a good idea - "dropping the marble" is often a better approach with this kind of work as you can of course cut in wherever you want but you may never be able to make that connect - or as it sounds may end up introducing a whole new set of problems to solve that could have been avoided.

@hanhanW
Copy link
Contributor Author

hanhanW commented Dec 10, 2024

We probably won't have the clone op issue if the frontend can pass the tensor type with encoding directly. I think the op is introduced because of the flow.tensor.bitcast op in the input program.

Today we do not have a way to make the end-to-end prototype work, so we "hack" the program to cast the i8 tensor to the i1 tensor with the encoding.

%input1 = util.unfoldable_constant dense<[85]> : tensor<1xi8>  // b01010101
%lhs = flow.tensor.bitcast %input1 : tensor<1xi8> -> tensor<8xi1, #packed>

That said, we could avoid the issue if the input program can be something like below. I don't think we need to "fix" it. It is more like having a prototype and ask frontend to generate the i1 types with the encoding directly.

func.func @main(%lhs : tensor<8xi1, #packed>) {
  ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen Shared code generation infrastructure and dialects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants