Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing executable linking when other targets are present. #19035

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

benvanik
Copy link
Collaborator

@benvanik benvanik commented Nov 5, 2024

The existing linking code was all kinds of wrong when multiple executables with disjoint entry points were present. Linking needs to be reworked in general but this incremental change ensures that targets only link executables that contain variants that use them. There are definitely still corner cases that don't work.

This is a small step towards towards heterogeneous devices. The following example now compiles and runs:

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}>
#executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb">

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device
util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [
  #executable_target_vmvx_bytecode_fb
]> : !hal.device

func.func public @mutli_device_mul_add(
  // Input argument is resident on device_a (tooling default to first device).
  %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) -> (
  // Output result is expected to be on device_a (though not required).
  tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) {
  // Compute on device_a (input is there).
  %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32>
  %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32>
  // Transfer the result from device_a -> device_b.
  %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b>
  // Compute on device_b.
  %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32>
  %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32>
  // Transfer the result from device_b -> device_a.
  %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a>
  // More compute on device_a - should produce into the result buffer.
  %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32>
  // Return the result on device_a (as required by ABI attr).
  func.return %result_a2 : tensor<4xf32>
}
$ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb
$ iree-run-module \
    --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \
    --device=local-task --device=local-task --task_topology_group_count=1

(testing this in-tree is hard right now due to ergonomics issues - this is all experimental anyway)

@benvanik benvanik added bug 🐞 Something isn't working compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) labels Nov 5, 2024
@benvanik benvanik requested a review from hanhanW November 5, 2024 21:43
@benvanik benvanik changed the title Fixing VMVX linking when other targets are present. [WIP] Fixing VMVX linking when other targets are present. Nov 5, 2024
@benvanik benvanik force-pushed the users/benvanik/vmvx-linking-fix branch from f84c197 to 5843c96 Compare November 5, 2024 23:12
@benvanik benvanik changed the title [WIP] Fixing VMVX linking when other targets are present. [WIP] Fixing executable linking when other targets are present. Nov 5, 2024
@benvanik benvanik changed the title [WIP] Fixing executable linking when other targets are present. Fixing executable linking when other targets are present. Nov 5, 2024
@benvanik benvanik force-pushed the users/benvanik/vmvx-linking-fix branch from 5843c96 to 1b742fd Compare November 6, 2024 00:24
@hanhanW
Copy link
Contributor

hanhanW commented Nov 6, 2024

There is a typo in the example. The comma is in a wrong place.

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_
  #executable_target_embedded_elf_x86_64_whatever,
]> : !hal.device

It should be

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay!

@benvanik benvanik requested a review from kuhar November 6, 2024 05:27
@benvanik benvanik enabled auto-merge (squash) November 6, 2024 05:30
@benvanik benvanik merged commit 44adc3a into main Nov 6, 2024
38 of 39 checks passed
@benvanik benvanik deleted the users/benvanik/vmvx-linking-fix branch November 6, 2024 05:45
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…9035)

The existing linking code was all kinds of wrong when multiple
executables with disjoint entry points were present. Linking needs to be
reworked in general but this incremental change ensures that targets
only link executables that contain variants that use them. There are
definitely still corner cases that don't work.

This is a small step towards towards heterogeneous devices. The
following example now compiles and runs:
```mlir
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}>
#executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb">

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device
util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [
  #executable_target_vmvx_bytecode_fb
]> : !hal.device

func.func public @mutli_device_mul_add(
  // Input argument is resident on device_a (tooling default to first device).
  %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) -> (
  // Output result is expected to be on device_a (though not required).
  tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) {
  // Compute on device_a (input is there).
  %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32>
  %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32>
  // Transfer the result from device_a -> device_b.
  %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b>
  // Compute on device_b.
  %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32>
  %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32>
  // Transfer the result from device_b -> device_a.
  %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a>
  // More compute on device_a - should produce into the result buffer.
  %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32>
  // Return the result on device_a (as required by ABI attr).
  func.return %result_a2 : tensor<4xf32>
}
```

```sh
$ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb
$ iree-run-module \
    --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \
    --device=local-task --device=local-task --task_topology_group_count=1
```

(testing this in-tree is hard right now due to ergonomics issues - this
is all experimental anyway)
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…9035)

The existing linking code was all kinds of wrong when multiple
executables with disjoint entry points were present. Linking needs to be
reworked in general but this incremental change ensures that targets
only link executables that contain variants that use them. There are
definitely still corner cases that don't work.

This is a small step towards towards heterogeneous devices. The
following example now compiles and runs:
```mlir
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}>
#executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb">

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device
util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [
  #executable_target_vmvx_bytecode_fb
]> : !hal.device

func.func public @mutli_device_mul_add(
  // Input argument is resident on device_a (tooling default to first device).
  %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) -> (
  // Output result is expected to be on device_a (though not required).
  tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) {
  // Compute on device_a (input is there).
  %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32>
  %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32>
  // Transfer the result from device_a -> device_b.
  %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b>
  // Compute on device_b.
  %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32>
  %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32>
  // Transfer the result from device_b -> device_a.
  %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a>
  // More compute on device_a - should produce into the result buffer.
  %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32>
  // Return the result on device_a (as required by ABI attr).
  func.return %result_a2 : tensor<4xf32>
}
```

```sh
$ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb
$ iree-run-module \
    --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \
    --device=local-task --device=local-task --task_topology_group_count=1
```

(testing this in-tree is hard right now due to ergonomics issues - this
is all experimental anyway)

Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants