Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

micro changes #2

Closed
wants to merge 1 commit into from
Closed

micro changes #2

wants to merge 1 commit into from

Conversation

lamarrr
Copy link

@lamarrr lamarrr commented Sep 23, 2019

Unnecessary branching

Unnecessary branching
@nihui
Copy link

nihui commented Sep 23, 2019

this branching is the coding style
So any bit testing follow the same branching style here

@benvanik
Copy link
Collaborator

benvanik commented Oct 5, 2019

Sorry, but purely stylistic changes like this are difficult to integrate.

@benvanik benvanik closed this Oct 5, 2019
benvanik added a commit that referenced this pull request Jun 1, 2020
# This is the 1st commit message:

Switching VM calling convention and making the stack growable.
Reduces the default stack to be small enough to fit on the host stack and
dynamically growable if needed. This dramatically reduces our memory
footprint and speeds up almost everything VM-related (invocation from C
to bytecode, bytecode to C, and bytecode to bytecode). It also has the
effect of properly guarding register array accesses via valid bit masks
that expensive logic in the actual bytecode op implementations.

Latest VM bytecode benchmark numbers for desktop:
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_ModuleCreate                     176 ns          176 ns      3733333
BM_ModuleCreateState               72.4 ns         71.5 ns      8960000
BM_FullModuleInit                   236 ns          235 ns      2986667
BM_EmptyFuncReference              2.57 ns         2.55 ns    263529412
BM_EmptyFuncBytecode               76.4 ns         76.7 ns     11200000
BM_CallInternalFuncReference       2.27 ns         2.29 ns    320000000
BM_CallInternalFuncBytecode        22.4 ns         22.5 ns     29866680
BM_CallImportedFuncBytecode        20.7 ns         20.5 ns     32000000
BM_LoopSumReference/100000         1.17 ns         1.17 ns    640000000
BM_LoopSumBytecode/1000000         7.73 ns         7.64 ns     90000000

# The commit message #2 will be skipped:

# x
@GMNGeoffrey GMNGeoffrey mentioned this pull request Jul 7, 2020
This was referenced Aug 5, 2020
@ScottTodd ScottTodd mentioned this pull request Aug 17, 2020
dcaballe added a commit to dcaballe/iree that referenced this pull request Feb 22, 2023
qcolombet added a commit to qcolombet/iree that referenced this pull request Mar 15, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix.
Plumb the affine.apply decomposition through a new pass: decompose-affine-ops.
Rework the lowering pipeline to connect all the piece together:
1. extract-address-computation turns address computation into subviews
2. expand-strided-metadata turns subviews into affine.apply
3. licm hoists the code introduced by iree-org#2 in the right scf.for loop
4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting
   subexpressions can be hoisted in the right loops.
5. licm hoists the code introduced by iree-org#4
6. lower-affine materializes the decomposed `affine.apply`s. We do that early
   to avoid the canonicalization to undo this work.

Phase 3-5 needs to run on `scf.for`, so the whole process has to run before
scf to cf.

Missing bits:
- More comments
- Add tests
- Fix the subviews sizes for non-unary loads (although it doesn't break
  anything this is technically incorrect.)
- LLVM reassociate undo some of the thing we improve here. Need to file a bug
  for that, investigate and fix.

Note: extract-address-computation could be moved to LLVM open source,
but we need to figure out where it could live since it has both a dependency on
memref and nvgpu. We probably want to come up with an interface like
`isAddressComputationExtractable` to push it upstream.
qcolombet added a commit to qcolombet/iree that referenced this pull request Mar 17, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix.
Plumb the affine.apply decomposition through a new pass: decompose-affine-ops.
Rework the lowering pipeline to connect all the piece together:
1. extract-address-computation turns address computation into subviews
2. expand-strided-metadata turns subviews into affine.apply
3. licm hoists the code introduced by iree-org#2 in the right scf.for loop
4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting
   subexpressions can be hoisted in the right loops.
5. licm hoists the code introduced by iree-org#4
6. lower-affine materializes the decomposed `affine.apply`s. We do that early
   to avoid the canonicalization to undo this work.

Phase 3-5 needs to run on `scf.for`, so the whole process has to run before
scf to cf.

Missing bits:
- More comments
- Add tests
- Fix the subviews sizes for non-unary loads (although it doesn't break
  anything this is technically incorrect.)
- LLVM reassociate undo some of the thing we improve here. Need to file a bug
  for that, investigate and fix.

Note: extract-address-computation could be moved to LLVM open source,
but we need to figure out where it could live since it has both a dependency on
memref and nvgpu. We probably want to come up with an interface like
`isAddressComputationExtractable` to push it upstream.
qcolombet added a commit to qcolombet/iree that referenced this pull request Mar 21, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix.
Plumb the affine.apply decomposition through a new pass: decompose-affine-ops.
Rework the lowering pipeline to connect all the piece together:
1. extract-address-computation turns address computation into subviews
2. expand-strided-metadata turns subviews into affine.apply
3. licm hoists the code introduced by iree-org#2 in the right scf.for loop
4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting
   subexpressions can be hoisted in the right loops.
5. licm hoists the code introduced by iree-org#4
6. lower-affine materializes the decomposed `affine.apply`s. We do that early
   to avoid the canonicalization to undo this work.

Phase 3-5 needs to run on `scf.for`, so the whole process has to run before
scf to cf.

TODO:
- Add support for memref.store, vector.transfer_xxx

Note: extract-address-computation could be moved to LLVM open source,
but we need to figure out where it could live since it has both a dependency on
memref and nvgpu. We probably want to come up with an interface like
`isAddressComputationExtractable` to push it upstream.
qcolombet added a commit to qcolombet/iree that referenced this pull request Mar 24, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix.
Plumb the affine.apply decomposition through a new pass: decompose-affine-ops.
Rework the lowering pipeline to connect all the piece together:
1. extract-address-computation turns address computation into subviews
2. expand-strided-metadata turns subviews into affine.apply
3. licm hoists the code introduced by iree-org#2 in the right scf.for loop
4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting
   subexpressions can be hoisted in the right loops.
5. licm hoists the code introduced by iree-org#4
6. lower-affine materializes the decomposed `affine.apply`s. We do that early
   to avoid the canonicalization to undo this work.

Phase 3-5 needs to run on `scf.for`, so the whole process has to run before
scf to cf.

TODO:
- Add support for memref.store, vector.transfer_xxx

Note: extract-address-computation could be moved to LLVM open source,
but we need to figure out where it could live since it has both a dependency on
memref and nvgpu. We probably want to come up with an interface like
`isAddressComputationExtractable` to push it upstream.
qcolombet added a commit to qcolombet/iree that referenced this pull request Mar 24, 2023
Add a pass to extract address computation from memref.load and nvgpu.ldmatrix.
Plumb the affine.apply decomposition through a new pass: decompose-affine-ops.
Rework the lowering pipeline to connect all the piece together:
1. extract-address-computation turns address computation into subviews
2. expand-strided-metadata turns subviews into affine.apply
3. licm hoists the code introduced by iree-org#2 in the right scf.for loop
4. decompose-affine-ops breaks down the `affine.apply`s so that the resulting
   subexpressions can be hoisted in the right loops.
5. licm hoists the code introduced by iree-org#4
6. lower-affine materializes the decomposed `affine.apply`s. We do that early
   to avoid the canonicalization to undo this work.

Phase 3-5 needs to run on `scf.for`, so the whole process has to run before
scf to cf.

TODO:
- Add support for vector.transfer_xxx

Note: extract-address-computation could be moved to LLVM open source,
but we need to figure out where it could live since it has both a dependency on
memref and nvgpu. We probably want to come up with an interface like
`isAddressComputationExtractable` to push it upstream.
ScottTodd added a commit that referenced this pull request Aug 22, 2023
Caught by ASan:

```
370: =================================================================
370: ==3911909==ERROR: LeakSanitizer: detected memory leaks
370: 
370: Direct leak of 376 byte(s) in 1 object(s) allocated from:
370:     #0 0x6a9b022 in calloc (iree-build/tools/iree-run-mlir+0x6a9b022)
370:     #1 0x6ad5d47 in iree_allocator_system_alloc iree/runtime/src/iree/base/allocator.c:104:17
370:     #2 0x6ad5d47 in iree_allocator_system_ctl iree/runtime/src/iree/base/allocator.c:144:14
370:     #3 0x6ad56ad in iree_allocator_issue_alloc iree/runtime/src/iree/base/allocator.c:27:10
370:     #4 0x6ad56ad in iree_allocator_malloc iree/runtime/src/iree/base/allocator.c:32:10
370:     #5 0x1acf2486 in iree_vm_bytecode_module_create iree/runtime/src/iree/vm/bytecode/module.c:836:3
370:     #6 0x6afdf31 in iree_tooling_create_run_context iree/runtime/src/iree/tooling/run_module.c:107:9
370:     #7 0x6afdf31 in iree_tooling_run_module_with_data iree/runtime/src/iree/tooling/run_module.c:340:3
370:     #8 0x6ad2a24 in iree::(anonymous namespace)::CompileAndRunFile(iree_compiler_session_t*, char const*) iree/tools/iree-run-mlir-main.cc:359:3
370:     #9 0x6ad2a24 in main iree/tools/iree-run-mlir-main.cc:520:20
370:     #10 0x7fce3bc456c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
```
qcolombet added a commit to qcolombet/iree that referenced this pull request Sep 28, 2023
This patch adds a transform dialect interpreter pass that can be used to
annotate specific operations with specific strategies. This patch relies on
iree-org#14788 to actually "link" the strategy
within the related module.

The intended use case, as demonstrated in the added test cases, is to:
1. specify the matcher in a dedicated file (in the transform dialect format)
   that is passed to the compiler through
   `--iree-llvmcpu-transform-dialect-select-strategy`.
2. provide the strategy as a named sequence through the library option
   `--iree-codegen-transform-library-file-name`.

If the matcher applies in iree-org#1, then the transform dialect pipeline will pick
up the proper strategy for iree-org#2 and apply it to the annotated operations.
bjacob added a commit that referenced this pull request Nov 10, 2023
…e_thread_request_affinity` (#15499)

TSan report:

```
WARNING: ThreadSanitizer: data race (pid=45817)
  Read of size 4 at 0x0001084004e0 by thread T2:
    #0 iree_thread_request_affinity threading_darwin.c:230 (local-task_vmvx_semaphore_submission_test:arm64+0x100078f40)
    #1 iree_task_worker_main worker.c:385 (local-task_vmvx_semaphore_submission_test:arm64+0x100071594)
    #2 iree_thread_start_routine threading_darwin.c:72 (local-task_vmvx_semaphore_submission_test:arm64+0x100078e3c)

  Previous write of size 4 at 0x0001084004e0 by main thread:
    #0 iree_thread_create threading_darwin.c:140 (local-task_vmvx_semaphore_submission_test:arm64+0x100078ca4)
    #1 iree_task_worker_initialize worker.c:66 (local-task_vmvx_semaphore_submission_test:arm64+0x1000714f8)
    #2 iree_task_executor_create executor.c:161 (local-task_vmvx_semaphore_submission_test:arm64+0x10006b2b0)
```

The read of `thread->mach_port` at
https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L230

is not ordered relatively to the write of that variable in the parent
thread after `pthread_mach_thread_np` returns:
https://github.com/openxla/iree/blob/ccc4c3719cea467477a783f1c9e9f1fc06b4c508/runtime/src/iree/base/internal/threading_darwin.c#L140

The proposed fix is that the worker thread shouldn't need to access its
own `thread->mach_port`, it can equivalently call `mach_task_self()`.
ScottTodd added a commit that referenced this pull request Aug 20, 2024
Reverts #17751. A few of the new tests are failing on
various platforms:

* Timeouts (after 60 seconds) in
`iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task`
on GitHub-hosted Windows and macOS runners
*
https://github.com/iree-org/iree/actions/runs/10468974350/job/28990992473#step:8:2477
*
https://github.com/iree-org/iree/actions/runs/10468947894/job/28990909629#step:9:3076
    
    ```
1529/1568 Test #969:
iree/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_large_llvm-cpu_local-task
.............................***Timeout 60.07 sec
---
TEST[attention_2_2048_256_512_128_dtype_f16_f16_f16_f16_2_2048_256_512_128_256_1.0_0]
---
    Attention shape (BATCHxMxK1xK2xN): 2x2048x256x512x256x128
    ```

* Compilation error on arm64:
https://github.com/iree-org/iree/actions/runs/10468944505/job/28990909321#step:4:9815:

    ```
[415/1150] Generating
/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb
from
e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir
FAILED:
tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb
/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb
cd /work/build-arm64/tests/e2e/attention &&
/work/build-arm64/tools/iree-compile --output-format=vm-bytecode
--mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu
/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir
-o
/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.vmfb
--iree-hal-executable-object-search-path=\"/work/build-arm64\"
--iree-llvmcpu-embedded-linker-path=\"/work/build-arm64/llvm-project/bin/lld\"
--iree-llvmcpu-wasm-linker-path=\"/work/build-arm64/llvm-project/bin/lld\"

/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14:
error: Yield operand #2 is not equivalent to the corresponding iter
bbArg
      %result1 = iree_linalg_ext.attention {
                 ^

/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1:
note: called from
func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query:
tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value:
tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> {
    ^

/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:4:14:
error: failed to run translation of source executable to target
executable for backend #hal.executable.target<"llvm-cpu",
"embedded-elf-arm_64", {cpu = "generic", cpu_features = "+reserve-x18",
data_layout =
"e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32",
native_vector_size = 16 : i64, target_triple =
"aarch64-unknown-unknown-eabi-elf"}>
      %result1 = iree_linalg_ext.attention {
                 ^

/work/build-arm64/tests/e2e/attention/e2e_attention_cpu_f16_f16_f16_medium_llvm-cpu_local-task_attention.mlir:1:1:
note: called from
func.func @attention_2_1024_128_256_64_dtype_f16_f16_f16_f16(%query:
tensor<2x1024x128xf16>, %key: tensor<2x256x128xf16>, %value:
tensor<2x256x64xf16>, %scale: f32) -> tensor<2x1024x64xf16> {
    ^
    failed to translate executables
    ```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants