-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
micro changes #2
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unnecessary branching
this branching is the coding style |
Sorry, but purely stylistic changes like this are difficult to integrate. |
benvanik
added a commit
that referenced
this pull request
Jun 1, 2020
# This is the 1st commit message: Switching VM calling convention and making the stack growable. Reduces the default stack to be small enough to fit on the host stack and dynamically growable if needed. This dramatically reduces our memory footprint and speeds up almost everything VM-related (invocation from C to bytecode, bytecode to C, and bytecode to bytecode). It also has the effect of properly guarding register array accesses via valid bit masks that expensive logic in the actual bytecode op implementations. Latest VM bytecode benchmark numbers for desktop: ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ModuleCreate 176 ns 176 ns 3733333 BM_ModuleCreateState 72.4 ns 71.5 ns 8960000 BM_FullModuleInit 236 ns 235 ns 2986667 BM_EmptyFuncReference 2.57 ns 2.55 ns 263529412 BM_EmptyFuncBytecode 76.4 ns 76.7 ns 11200000 BM_CallInternalFuncReference 2.27 ns 2.29 ns 320000000 BM_CallInternalFuncBytecode 22.4 ns 22.5 ns 29866680 BM_CallImportedFuncBytecode 20.7 ns 20.5 ns 32000000 BM_LoopSumReference/100000 1.17 ns 1.17 ns 640000000 BM_LoopSumBytecode/1000000 7.73 ns 7.64 ns 90000000 # The commit message #2 will be skipped: # x
Merged
Merged
dcaballe
added a commit
to dcaballe/iree
that referenced
this pull request
Feb 22, 2023
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Mar 15, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. Missing bits: - More comments - Add tests - Fix the subviews sizes for non-unary loads (although it doesn't break anything this is technically incorrect.) - LLVM reassociate undo some of the thing we improve here. Need to file a bug for that, investigate and fix. Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Mar 17, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. Missing bits: - More comments - Add tests - Fix the subviews sizes for non-unary loads (although it doesn't break anything this is technically incorrect.) - LLVM reassociate undo some of the thing we improve here. Need to file a bug for that, investigate and fix. Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Mar 21, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. TODO: - Add support for memref.store, vector.transfer_xxx Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Mar 24, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. TODO: - Add support for memref.store, vector.transfer_xxx Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Mar 24, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix. Plumb the affine.apply decomposition through a new pass: decompose-affine-ops. Rework the lowering pipeline to connect all the piece together: 1. extract-address-computation turns address computation into subviews 2. expand-strided-metadata turns subviews into affine.apply 3. licm hoists the code introduced by iree-org#2 in the right scf.for loop 4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting subexpressions can be hoisted in the right loops. 5. licm hoists the code introduced by iree-org#4 6. lower-affine materializes the decomposed `affine.apply`s. We do that early to avoid the canonicalization to undo this work. Phase 3-5 needs to run on `scf.for`, so the whole process has to run before scf to cf. TODO: - Add support for vector.transfer_xxx Note: extract-address-computation could be moved to LLVM open source, but we need to figure out where it could live since it has both a dependency on memref and nvgpu. We probably want to come up with an interface like `isAddressComputationExtractable` to push it upstream.
ScottTodd
added a commit
that referenced
this pull request
Aug 22, 2023
Caught by ASan: ``` 370: ================================================================= 370: ==3911909==ERROR: LeakSanitizer: detected memory leaks 370: 370: Direct leak of 376 byte(s) in 1 object(s) allocated from: 370: #0 0x6a9b022 in calloc (iree-build/tools/iree-run-mlir+0x6a9b022) 370: #1 0x6ad5d47 in iree_allocator_system_alloc iree/runtime/src/iree/base/allocator.c:104:17 370: #2 0x6ad5d47 in iree_allocator_system_ctl iree/runtime/src/iree/base/allocator.c:144:14 370: #3 0x6ad56ad in iree_allocator_issue_alloc iree/runtime/src/iree/base/allocator.c:27:10 370: #4 0x6ad56ad in iree_allocator_malloc iree/runtime/src/iree/base/allocator.c:32:10 370: #5 0x1acf2486 in iree_vm_bytecode_module_create iree/runtime/src/iree/vm/bytecode/module.c:836:3 370: #6 0x6afdf31 in iree_tooling_create_run_context iree/runtime/src/iree/tooling/run_module.c:107:9 370: #7 0x6afdf31 in iree_tooling_run_module_with_data iree/runtime/src/iree/tooling/run_module.c:340:3 370: #8 0x6ad2a24 in iree::(anonymous namespace)::CompileAndRunFile(iree_compiler_session_t*, char const*) iree/tools/iree-run-mlir-main.cc:359:3 370: #9 0x6ad2a24 in main iree/tools/iree-run-mlir-main.cc:520:20 370: #10 0x7fce3bc456c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 ```
stellaraccident
pushed a commit
that referenced
this pull request
Sep 24, 2023
qcolombet
added a commit
to qcolombet/iree
that referenced
this pull request
Sep 28, 2023
This patch adds a transform dialect interpreter pass that can be used to annotate specific operations with specific strategies. This patch relies on iree-org#14788 to actually "link" the strategy within the related module. The intended use case, as demonstrated in the added test cases, is to: 1. specify the matcher in a dedicated file (in the transform dialect format) that is passed to the compiler through `--iree-llvmcpu-transform-dialect-select-strategy`. 2. provide the strategy as a named sequence through the library option `--iree-codegen-transform-library-file-name`. If the matcher applies in iree-org#1, then the transform dialect pipeline will pick up the proper strategy for iree-org#2 and apply it to the annotated operations.
bjacob
added a commit
that referenced
this pull request
Nov 10, 2023
…e_thread_request_affinity` (#15499) TSan report: ``` WARNING: ThreadSanitizer: data race (pid=45817) Read of size 4 at 0x0001084004e0 by thread T2: #0 iree_thread_request_affinity threading_darwin.c:230 (local-task_vmvx_semaphore_submission_test:arm64+0x100078f40) #1 iree_task_worker_main worker.c:385 (local-task_vmvx_semaphore_submission_test:arm64+0x100071594) #2 iree_thread_start_routine threading_darwin.c:72 (local-task_vmvx_semaphore_submission_test:arm64+0x100078e3c) Previous write of size 4 at 0x0001084004e0 by main thread: #0 iree_thread_create threading_darwin.c:140 (local-task_vmvx_semaphore_submission_test:arm64+0x100078ca4) #1 iree_task_worker_initialize worker.c:66 (local-task_vmvx_semaphore_submission_test:arm64+0x1000714f8) #2 iree_task_executor_create executor.c:161 (local-task_vmvx_semaphore_submission_test:arm64+0x10006b2b0) ``` The read of `thread->mach_port` at https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L230 is not ordered relatively to the write of that variable in the parent thread after `pthread_mach_thread_np` returns: https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L140 The proposed fix is that the worker thread shouldn't need to access its own `thread->mach_port`, it can equivalently call `mach_task_self()`.
ScottTodd
added a commit
that referenced
this pull request
Aug 20, 2024
Reverts #17751. A few of the new tests are failing on various platforms: * Timeouts (after 60 seconds) in `iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task` on GitHub-hosted Windows and macOS runners * https://github.com/iree-org/iree/actions/runs/10468974350/job/28990992473#step:8:2477 * https://github.com/iree-org/iree/actions/runs/10468947894/job/28990909629#step:9:3076 ``` 1529/1568 Test #969: iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task .............................***Timeout 60.07 sec --- TEST[attention_2_2048_256_512_128_dtype_f16_f16_f16_f16_2_2048_256_512_128_256_1.0_0] --- Attention shape (BATCHxMxK1xK2xN): 2x2048x256x512x256x128 ``` * Compilation error on arm64: https://github.com/iree-org/iree/actions/runs/10468944505/job/28990909321#step:4:9815: ``` [415/1150] Generating /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb from e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir FAILED: tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb cd /work/build-arm64/tests/e2e/attention && /work/build-arm64/tools/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir -o /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb --iree-hal-executable-object-search-path=\"/work/build-arm64\" --iree-llvmcpu-embedded-linker-path=\"/work/build-arm64/llvm-project/bin/lld\" --iree-llvmcpu-wasm-linker-path=\"/work/build-arm64/llvm-project/bin/lld\" /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14: error: Yield operand #2 is not equivalent to the corresponding iter bbArg %result1 = iree_linalg_ext.attention { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1: note: called from func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query: tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value: tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-arm_64", {cpu = "generic", cpu_features = "+reserve-x18", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32", native_vector_size = 16 : i64, target_triple = "aarch64-unknown-unknown-eabi-elf"}> %result1 = iree_linalg_ext.attention { ^ /work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1: note: called from func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query: tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value: tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> { ^ failed to translate executables ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Unnecessary branching