micro changes #2

lamarrr · 2019-09-23T00:41:44Z

Unnecessary branching

nihui · 2019-09-23T04:16:36Z

this branching is the coding style
So any bit testing follow the same branching style here

benvanik · 2019-10-05T01:16:51Z

Sorry, but purely stylistic changes like this are difficult to integrate.

# This is the 1st commit message: Switching VM calling convention and making the stack growable. Reduces the default stack to be small enough to fit on the host stack and dynamically growable if needed. This dramatically reduces our memory footprint and speeds up almost everything VM-related (invocation from C to bytecode, bytecode to C, and bytecode to bytecode). It also has the effect of properly guarding register array accesses via valid bit masks that expensive logic in the actual bytecode op implementations. Latest VM bytecode benchmark numbers for desktop: ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ModuleCreate 176 ns 176 ns 3733333 BM_ModuleCreateState 72.4 ns 71.5 ns 8960000 BM_FullModuleInit 236 ns 235 ns 2986667 BM_EmptyFuncReference 2.57 ns 2.55 ns 263529412 BM_EmptyFuncBytecode 76.4 ns 76.7 ns 11200000 BM_CallInternalFuncReference 2.27 ns 2.29 ns 320000000 BM_CallInternalFuncBytecode 22.4 ns 22.5 ns 29866680 BM_CallImportedFuncBytecode 20.7 ns 20.5 ns 32000000 BM_LoopSumReference/100000 1.17 ns 1.17 ns 640000000 BM_LoopSumBytecode/1000000 7.73 ns 7.64 ns 90000000 # The commit message #2 will be skipped: # x

Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. Missing bits: - More comments - Add tests - Fix the subviews sizes for non-unary loads (although it doesn't break anything this is technically incorrect.) - LLVM reassociate undo some of the thing we improve here. Need to file a bug for that, investigate and fix. Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.

Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. TODO: - Add support for memref.store, vector.transfer_xxx Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.

Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. TODO: - Add support for vector.transfer_xxx Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.

Caught by ASan: ``` 370: ================================================================= 370: ==3911909==ERROR: LeakSanitizer: detected memory leaks 370: 370: Direct leak of 376 byte(s) in 1 object(s) allocated from: 370: #0 0x6a9b022 in calloc (iree-build/tools/iree-run-mlir+0x6a9b022) 370: #1 0x6ad5d47 in iree_allocator_system_alloc iree/runtime/src/iree/base/allocator.c:104:17 370: #2 0x6ad5d47 in iree_allocator_system_ctl iree/runtime/src/iree/base/allocator.c:144:14 370: #3 0x6ad56ad in iree_allocator_issue_alloc iree/runtime/src/iree/base/allocator.c:27:10 370: #4 0x6ad56ad in iree_allocator_malloc iree/runtime/src/iree/base/allocator.c:32:10 370: #5 0x1acf2486 in iree_vm_bytecode_module_create iree/runtime/src/iree/vm/bytecode/module.c:836:3 370: #6 0x6afdf31 in iree_tooling_create_run_context iree/runtime/src/iree/tooling/run_module.c:107:9 370: #7 0x6afdf31 in iree_tooling_run_module_with_data iree/runtime/src/iree/tooling/run_module.c:340:3 370: #8 0x6ad2a24 in iree::(anonymous namespace)::CompileAndRunFile(iree_compiler_session_t*, char const*) iree/tools/iree-run-mlir-main.cc:359:3 370: #9 0x6ad2a24 in main iree/tools/iree-run-mlir-main.cc:520:20 370: #10 0x7fce3bc456c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 ```

This patch adds a transform dialect interpreter pass that can be used to annotate specific operations with specific strategies. This patch relies on iree-org#14788 to actually "link" the strategy within the related module. The intended use case, as demonstrated in the added test cases, is to: 1. specify the matcher in a dedicated file (in the transform dialect format) that is passed to the compiler through `--iree-llvmcpu-transform-dialect-select-strategy`. 2. provide the strategy as a named sequence through the library option `--iree-codegen-transform-library-file-name`. If the matcher applies in iree-org#1, then the transform dialect pipeline will pick up the proper strategy for iree-org#2 and apply it to the annotated operations.

…e_thread_request_affinity` (#15499) TSan report: ``` WARNING: ThreadSanitizer: data race (pid=45817) Read of size 4 at 0x0001084004e0 by thread T2: #0 iree_thread_request_affinity threading_darwin.c:230 (local-task_vmvx_semaphore_submission_test:arm64+0x100078f40) #1 iree_task_worker_main worker.c:385 (local-task_vmvx_semaphore_submission_test:arm64+0x100071594) #2 iree_thread_start_routine threading_darwin.c:72 (local-task_vmvx_semaphore_submission_test:arm64+0x100078e3c) Previous write of size 4 at 0x0001084004e0 by main thread: #0 iree_thread_create threading_darwin.c:140 (local-task_vmvx_semaphore_submission_test:arm64+0x100078ca4) #1 iree_task_worker_initialize worker.c:66 (local-task_vmvx_semaphore_submission_test:arm64+0x1000714f8) #2 iree_task_executor_create executor.c:161 (local-task_vmvx_semaphore_submission_test:arm64+0x10006b2b0) ``` The read of `thread->mach_port` at https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L230 is not ordered relatively to the write of that variable in the parent thread after `pthread_mach_thread_np` returns: https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L140 The proposed fix is that the worker thread shouldn't need to access its own `thread->mach_port`, it can equivalently call `mach_task_self()`.

Reverts #17751. A few of the new tests are failing on various platforms: * Timeouts (after 60 seconds) in `iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task` on GitHub-hosted Windows and macOS runners * https://github.com/iree-org/iree/actions/runs/10468974350/job/28990992473#step:8:2477 * https://github.com/iree-org/iree/actions/runs/10468947894/job/28990909629#step:9:3076 ``` 1529/1568 Test #969: iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task .............................***Timeout 60.07 sec --- TEST[attention_2_2048_256_512_128_dtype_f16_f16_f16_f16_2_2048_256_512_128_256_1.0_0] --- Attention shape (BATCHxMxK1xK2xN): 2x2048x256x512x256x128 ``` * Compilation error on arm64: https://github.com/iree-org/iree/actions/runs/10468944505/job/28990909321#step:4:9815: ``` [415/1150] Generating /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb from e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir FAILED: tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb cd /work/build-arm64/tests/e2e/attention && /work/build-arm64/tools/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir -o /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb --iree-hal-executable-object-search-path=\"/work/build-arm64\" --iree-llvmcpu-embedded-linker-path=\"/work/build-arm64/llvm-project/bin/lld\" --iree-llvmcpu-wasm-linker-path=\"/work/build-arm64/llvm-project/bin/lld\" /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14: error: Yield operand #2 is not equivalent to the corresponding iter bbArg %result1 = iree_linalg_ext.attention { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1: note: called from func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query: tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value: tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-arm_64", {cpu = "generic", cpu_features = "+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32", native_vector_size = 16 : i64, target_triple = "aarch64-unknown-unknown-eabi-elf"}> %result1 = iree_linalg_ext.attention { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1: note: called from func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query: tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value: tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> { ^ failed to translate executables ```

micro changes

79b8a02

Unnecessary branching

benvanik closed this Oct 5, 2019

stellaraccident mentioned this pull request Nov 1, 2019

Warning: iree/compiler/Translation/Sequencer/SequencerModuleTranslation.cpp:248:1: warning: control reaches end of non-void function [-Wreturn-type] } #111

Closed

stellaraccident mentioned this pull request Dec 11, 2019

Linkage failure when building iree-tblgen with CMake #182

Closed

GMNGeoffrey mentioned this pull request Jul 7, 2020

Merge main -> google #2420

Merged

This was referenced Aug 5, 2020

Merge main -> google #2791

Closed

Merge main -> google #2792

Merged

ScottTodd mentioned this pull request Aug 17, 2020

Merge main -> google #2925

Merged

benvanik mentioned this pull request Nov 20, 2020

Fusion for pad op in Linalg #2783

Open

inho9606 mentioned this pull request Feb 1, 2021

Added address sanitization feature for translating ML models to LLVM modules #4676

Merged

inho9606 mentioned this pull request Mar 23, 2021

Memory leaks in iree/test/e2e/models/collatz.mlir targeting VMLA backend #5202

Closed

jinchen62 mentioned this pull request Jul 10, 2021

(WIP) add deferred command buffer for cuda backend #6422

Closed

stellaraccident mentioned this pull request Sep 30, 2021

Missing "encoding_type" in the trace file that cause seg-fault #7218

Closed

GMNGeoffrey mentioned this pull request Oct 7, 2021

Lowering of mhlo.sort -> linalg_ext.sort leaves in unrealized_conversion_cast #7254

Closed

powderluv mentioned this pull request Jan 10, 2022

iree-benchmark-trace fails but iree-run-trace works #8061

Closed

ScottTodd mentioned this pull request Mar 15, 2022

Revert enabling Vulkan ASan tests, with cleanup. #8550

Merged

dcaballe mentioned this pull request Jul 14, 2022

Missing vectorization for gather ops #9198

Closed

qcolombet mentioned this pull request Sep 7, 2022

Generate 'vwmacc' instructions for quantized GEMM #10276

Closed

powderluv mentioned this pull request Sep 20, 2022

Vulkan/CUDA compiler crash with convnext #10482

Closed

dcaballe mentioned this pull request Oct 7, 2022

Vectorize pooling operations #10670

Closed

pashu123 mentioned this pull request Oct 13, 2022

IREE-compile failing with f16 types for vulkan backend. #10718

Closed

lundong mentioned this pull request Feb 2, 2023

IREE compilation fails on some tflite models #12040

Closed

dcaballe added a commit to dcaballe/iree that referenced this pull request Feb 22, 2023

Masking test iree-org#2

0b6667b

qcolombet mentioned this pull request Mar 15, 2023

Integrate the work for hoisting address computation #12643

Merged

GMNGeoffrey mentioned this pull request Mar 29, 2023

Add attention op as transform dialect op #12739

Merged

dcaballe mentioned this pull request Apr 21, 2023

Support i4/i3/i2 weight quantized matmul #12859

Open

allieculp mentioned this pull request May 4, 2023

[Epic] Convert to using TF Python API & reap the benefits in CI and benchmarks #13043

Closed

benvanik mentioned this pull request May 23, 2023

ScheduleAllocation crashes when compiling included sparse program #13729

Closed

monorimet mentioned this pull request Aug 3, 2023

[Vulkan] UNet Compilation failure - "error:no GPU subgroup mma compute ops generated" #14527

Closed

dpackwood mentioned this pull request Sep 8, 2023

RaiseSpecialOps (iree-flow-raise-special-ops) causes compiler crash for some input #14933

Closed

bjacob mentioned this pull request Sep 22, 2023

Comparison with GGML for quantized matmuls #14951

Open

stellaraccident pushed a commit that referenced this pull request Sep 24, 2023

Account for API drift in the LoadedExecutable object. (#2)

3273584

qcolombet mentioned this pull request Sep 28, 2023

Add a way to hook custom TD strategy for specific ops #15059

Closed

dcaballe mentioned this pull request Oct 16, 2023

[Codegen] Add the bitcast -> extui to shuffle folding patterns to EmulateNarrowTypes pass. #15102

Merged

NatashaKnk mentioned this pull request Nov 2, 2023

[CPU] Fuse DT pack/unpack through tensor.reshape #15242

Open

hiroto01230 mentioned this pull request Dec 16, 2023

Issue with Only Vulkan-Related Test Codes Failing on Android Device #15947

Open

gabeweisz mentioned this pull request Mar 25, 2024

Failure : unimplemented: found unhandled case of expansion/collapse in aten.view #16887

Open

LLITCHEV mentioned this pull request Mar 28, 2024

Crash in iree-llvmcpu-drop-vector-unit-dims pass while lowering TopK codegen #16926

Closed

monorimet mentioned this pull request Jun 5, 2024

(gfx1103/Windows) Numerics issues on HIP driver for SDXL Unet #17579

Open

deng-ShiFu mentioned this pull request Jul 3, 2024

Failed to compile Transformer model #17801

Closed

benvanik mentioned this pull request Jul 12, 2024

[Flow] Generalize horizontal contraction fusion to cover more cases. #17880

Merged

raikonenfnu mentioned this pull request Aug 16, 2024

Bump LLVM to llvm/llvm-project@ddda37a #18258

Merged

bjacob mentioned this pull request Dec 5, 2024

Integrate LLVM at 3f0cc068cef26e820b3acbd21b3577817e4bf4ca #19387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

micro changes #2

micro changes #2

lamarrr commented Sep 23, 2019

nihui commented Sep 23, 2019

benvanik commented Oct 5, 2019

micro changes #2

micro changes #2

Conversation

lamarrr commented Sep 23, 2019

nihui commented Sep 23, 2019

benvanik commented Oct 5, 2019