Updating Taichi results in slower compilation speed #6933

strongoier · 2022-12-20T08:51:59Z

Original user post: https://forum.taichi-lang.cn/t/topic/3710
Script to reproduce: https://github.com/JinliBot7/taichi-version-update-problem

The user reports that the script needs <10s to compile with Taichi v1.1.3 but needs ~30s to compile with Taichi v1.3.0.

On my MacBook Pro (13-inch, M1, 2020), I observe that the script needs ~5s to compile with Taichi v1.1.3 and needs ~20s to compile at 6f4ce42, which shows similar slowdown.

The text was updated successfully, but these errors were encountered:

Issue: #6933 ### Brief Summary There are many redundant copies of local vars in the initial IR: ``` <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127] $129 : local store [$100 <- $128] <[Tensor (3, 3) f32]> $130 = alloca $131 = local load [$100] $132 : local store [$130 <- $131] <[Tensor (3, 3) f32]> $133 = alloca $134 = local load [$130] $135 : local store [$133 <- $134] <[Tensor (3, 3) f32]> $136 = alloca $137 = local load [$133] $138 : local store [$136 <- $137] // In fact, `$128` can be used wherever `$136` is loaded. ``` These can come from many places; one of the main sources is the pass-by-value convention of `ti.func`. The consequence is that the number of instructions is unnecessarily large, which significantly slows down compilation. My solution here is to identify and eliminate such redundant instructions in the first place so all later passes can take a much smaller number of instructions as input. These redundant local vars are essentially immutable ones - they are assigned only once and only loaded after the assignment. In this PR, I add an optimization pass `eliminate_immutable_local_vars` as the first pass. (P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt` are fixed by the way to make the pass work properly.) Let's study the effects in two cases: #6933 and [voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py). First, let's compare the number of instructions after `scalarization` pass (which happens immediately after the first pass). | Kernel | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | `test` (#6933) | 45859 | 26452 | 42% | | `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% | Then, let's compare the total time of `compile()`. | Case | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | #6933 | 20.622s | 8.550s | 59% | | voxel-rt2 | 27.676s | 9.495s | 66% | Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ace_usages_with() (#7001) Issue: #6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in #6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: #2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in #6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ace_usages_with() (taichi-dev#7001) Issue: taichi-dev#6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in taichi-dev#6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6933 ### Brief Summary There are many redundant copies of local vars in the initial IR: ``` <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127] $129 : local store [$100 <- $128] <[Tensor (3, 3) f32]> $130 = alloca $131 = local load [$100] $132 : local store [$130 <- $131] <[Tensor (3, 3) f32]> $133 = alloca $134 = local load [$130] $135 : local store [$133 <- $134] <[Tensor (3, 3) f32]> $136 = alloca $137 = local load [$133] $138 : local store [$136 <- $137] // In fact, `$128` can be used wherever `$136` is loaded. ``` These can come from many places; one of the main sources is the pass-by-value convention of `ti.func`. The consequence is that the number of instructions is unnecessarily large, which significantly slows down compilation. My solution here is to identify and eliminate such redundant instructions in the first place so all later passes can take a much smaller number of instructions as input. These redundant local vars are essentially immutable ones - they are assigned only once and only loaded after the assignment. In this PR, I add an optimization pass `eliminate_immutable_local_vars` as the first pass. (P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt` are fixed by the way to make the pass work properly.) Let's study the effects in two cases: taichi-dev#6933 and [voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py). First, let's compare the number of instructions after `scalarization` pass (which happens immediately after the first pass). | Kernel | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | `test` (taichi-dev#6933) | 45859 | 26452 | 42% | | `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% | Then, let's compare the total time of `compile()`. | Case | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | taichi-dev#6933 | 20.622s | 8.550s | 59% | | voxel-rt2 | 27.676s | 9.495s | 66% | Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ace_usages_with() (taichi-dev#7001) Issue: taichi-dev#6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in taichi-dev#6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

strongoier added the advanced optimization The issue or bug is related to advanced optimization label Dec 20, 2022

strongoier self-assigned this Dec 20, 2022

strongoier added this to Taichi Lang Dec 20, 2022

strongoier moved this to Untriaged in Taichi Lang Dec 20, 2022

strongoier moved this from Untriaged to In Progress in Taichi Lang Dec 20, 2022

strongoier mentioned this issue Dec 20, 2022

[opt] Add pass eliminate_immutable_local_vars #6926

Merged

strongoier mentioned this issue Dec 28, 2022

[opt] Add ImmediateIRModifier to provide amortized constant-time replace_usages_with() #7001

Merged

strongoier closed this as completed Jan 4, 2023

github-project-automation bot moved this from In Progress to Done in Taichi Lang Jan 4, 2023

strongoier mentioned this issue Jan 4, 2023

[opt] Add ExtractPointers pass for dynamic index #7051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Taichi results in slower compilation speed #6933

Updating Taichi results in slower compilation speed #6933

strongoier commented Dec 20, 2022 •

edited

Loading

Updating Taichi results in slower compilation speed #6933

Updating Taichi results in slower compilation speed #6933

Comments

strongoier commented Dec 20, 2022 • edited Loading

strongoier commented Dec 20, 2022 •

edited

Loading