Fixing executable linking when other targets are present. #19035

benvanik · 2024-11-05T21:43:45Z

The existing linking code was all kinds of wrong when multiple executables with disjoint entry points were present. Linking needs to be reworked in general but this incremental change ensures that targets only link executables that contain variants that use them. There are definitely still corner cases that don't work.

This is a small step towards towards heterogeneous devices. The following example now compiles and runs:

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}>
#executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb">

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device
util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [
  #executable_target_vmvx_bytecode_fb
]> : !hal.device

func.func public @mutli_device_mul_add(
  // Input argument is resident on device_a (tooling default to first device).
  %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) -> (
  // Output result is expected to be on device_a (though not required).
  tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>}
) {
  // Compute on device_a (input is there).
  %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32>
  %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32>
  // Transfer the result from device_a -> device_b.
  %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b>
  // Compute on device_b.
  %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32>
  %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32>
  // Transfer the result from device_b -> device_a.
  %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a>
  // More compute on device_a - should produce into the result buffer.
  %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32>
  // Return the result on device_a (as required by ABI attr).
  func.return %result_a2 : tensor<4xf32>
}

$ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb
$ iree-run-module \
    --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \
    --device=local-task --device=local-task --task_topology_group_count=1

(testing this in-tree is hard right now due to ergonomics issues - this is all experimental anyway)

hanhanW · 2024-11-06T00:34:18Z

There is a typo in the example. The comma is in a wrong place.

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_
  #executable_target_embedded_elf_x86_64_whatever,
]> : !hal.device

It should be

util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [
  #executable_target_embedded_elf_x86_64_,
  #executable_target_embedded_elf_x86_64_whatever
]> : !hal.device

hanhanW

yay!

compiler/src/iree/compiler/Codegen/SPIRV/SPIRVLinkExecutables.cpp

compiler/src/iree/compiler/Codegen/Utils/LinkingUtils.cpp

…9035) The existing linking code was all kinds of wrong when multiple executables with disjoint entry points were present. Linking needs to be reworked in general but this incremental change ensures that targets only link executables that contain variants that use them. There are definitely still corner cases that don't work. This is a small step towards towards heterogeneous devices. The following example now compiles and runs: ```mlir #executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}> #executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}> #executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb"> util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [ #executable_target_embedded_elf_x86_64_, #executable_target_embedded_elf_x86_64_whatever ]> : !hal.device util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [ #executable_target_vmvx_bytecode_fb ]> : !hal.device func.func public @mutli_device_mul_add( // Input argument is resident on device_a (tooling default to first device). %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>} ) -> ( // Output result is expected to be on device_a (though not required). tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>} ) { // Compute on device_a (input is there). %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32> %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32> // Transfer the result from device_a -> device_b. %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b> // Compute on device_b. %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32> %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32> // Transfer the result from device_b -> device_a. %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a> // More compute on device_a - should produce into the result buffer. %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32> // Return the result on device_a (as required by ABI attr). func.return %result_a2 : tensor<4xf32> } ``` ```sh $ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb $ iree-run-module \ --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \ --device=local-task --device=local-task --task_topology_group_count=1 ``` (testing this in-tree is hard right now due to ergonomics issues - this is all experimental anyway)

…9035) The existing linking code was all kinds of wrong when multiple executables with disjoint entry points were present. Linking needs to be reworked in general but this incremental change ensures that targets only link executables that contain variants that use them. There are definitely still corner cases that don't work. This is a small step towards towards heterogeneous devices. The following example now compiles and runs: ```mlir #executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 4 : index}> #executable_target_embedded_elf_x86_64_whatever = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {target_triple = "x86_64-none-elf", native_vector_size = 16 : index}> #executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb"> util.global private @device_a = #hal.device.target<"local", {ordinal = 0 : index}, [ #executable_target_embedded_elf_x86_64_, #executable_target_embedded_elf_x86_64_whatever ]> : !hal.device util.global private @device_b = #hal.device.target<"local", {ordinal = 1 : index}, [ #executable_target_vmvx_bytecode_fb ]> : !hal.device func.func public @mutli_device_mul_add( // Input argument is resident on device_a (tooling default to first device). %input_a: tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>} ) -> ( // Output result is expected to be on device_a (though not required). tensor<4xf32> {iree.abi.affinity = #hal.device.affinity<@device_a>} ) { // Compute on device_a (input is there). %constant_a = arith.constant dense<[0.0, 1.0, 2.0, 3.0]> : tensor<4xf32> %transient_a = arith.mulf %input_a, %constant_a : tensor<4xf32> // Transfer the result from device_a -> device_b. %transient_b = flow.tensor.transfer %transient_a : tensor<4xf32> to #hal.device.affinity<@device_b> // Compute on device_b. %constant_b = arith.constant dense<[4.0, 5.0, 6.0, 7.0]> : tensor<4xf32> %result_b = arith.mulf %transient_b, %constant_b : tensor<4xf32> // Transfer the result from device_b -> device_a. %result_a = flow.tensor.transfer %result_b : tensor<4xf32> to #hal.device.affinity<@device_a> // More compute on device_a - should produce into the result buffer. %result_a2 = arith.addf %result_a, %constant_a : tensor<4xf32> // Return the result on device_a (as required by ABI attr). func.return %result_a2 : tensor<4xf32> } ``` ```sh $ iree-compile --iree-execution-model=async-external iree-run-module-multi.mlir -o module.vmfb $ iree-run-module \ --module=module.vmfb --function=mutli_device_mul_add --input=4xf32=10,11,12,13 \ --device=local-task --device=local-task --task_topology_group_count=1 ``` (testing this in-tree is hard right now due to ergonomics issues - this is all experimental anyway) Signed-off-by: Giacomo Serafini <[email protected]>

benvanik added bug 🐞 Something isn't working compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) labels Nov 5, 2024

benvanik requested a review from hanhanW November 5, 2024 21:43

benvanik changed the title ~~Fixing VMVX linking when other targets are present.~~ [WIP] Fixing VMVX linking when other targets are present. Nov 5, 2024

benvanik force-pushed the users/benvanik/vmvx-linking-fix branch from f84c197 to 5843c96 Compare November 5, 2024 23:12

benvanik changed the title ~~[WIP] Fixing VMVX linking when other targets are present.~~ [WIP] Fixing executable linking when other targets are present. Nov 5, 2024

benvanik changed the title ~~[WIP] Fixing executable linking when other targets are present.~~ Fixing executable linking when other targets are present. Nov 5, 2024

Fixing executable linking when other targets are present.

1b742fd

benvanik force-pushed the users/benvanik/vmvx-linking-fix branch from 5843c96 to 1b742fd Compare November 6, 2024 00:24

benvanik marked this pull request as ready for review November 6, 2024 00:44

benvanik requested review from antiagainst, MaheshRavishankar, kuhar, qedawkins and Groverkss as code owners November 6, 2024 00:44

benvanik removed request for antiagainst, MaheshRavishankar, kuhar, qedawkins and Groverkss November 6, 2024 00:44

hanhanW approved these changes Nov 6, 2024

View reviewed changes

kuhar reviewed Nov 6, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/SPIRV/SPIRVLinkExecutables.cpp Outdated Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Utils/LinkingUtils.cpp Outdated Show resolved Hide resolved

count_if -> any_of

80c3fc3

benvanik requested a review from kuhar November 6, 2024 05:27

benvanik enabled auto-merge (squash) November 6, 2024 05:30

benvanik merged commit 44adc3a into main Nov 6, 2024
38 of 39 checks passed

benvanik deleted the users/benvanik/vmvx-linking-fix branch November 6, 2024 05:45

stbaione mentioned this pull request Nov 6, 2024

Shortfin LLM Accuracy Degradation in Latest iree-compiler Release Candidate nod-ai/shark-ai#437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing executable linking when other targets are present. #19035

Fixing executable linking when other targets are present. #19035

benvanik commented Nov 5, 2024 •

edited

Loading

hanhanW commented Nov 6, 2024

hanhanW left a comment

Fixing executable linking when other targets are present. #19035

Fixing executable linking when other targets are present. #19035

Conversation

benvanik commented Nov 5, 2024 • edited Loading

hanhanW commented Nov 6, 2024

hanhanW left a comment

Choose a reason for hiding this comment

benvanik commented Nov 5, 2024 •

edited

Loading