-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Taichi results in slower compilation speed #6933
Labels
advanced optimization
The issue or bug is related to advanced optimization
Comments
strongoier
added
the
advanced optimization
The issue or bug is related to advanced optimization
label
Dec 20, 2022
strongoier
added a commit
that referenced
this issue
Dec 21, 2022
Issue: #6933 ### Brief Summary There are many redundant copies of local vars in the initial IR: ``` <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127] $129 : local store [$100 <- $128] <[Tensor (3, 3) f32]> $130 = alloca $131 = local load [$100] $132 : local store [$130 <- $131] <[Tensor (3, 3) f32]> $133 = alloca $134 = local load [$130] $135 : local store [$133 <- $134] <[Tensor (3, 3) f32]> $136 = alloca $137 = local load [$133] $138 : local store [$136 <- $137] // In fact, `$128` can be used wherever `$136` is loaded. ``` These can come from many places; one of the main sources is the pass-by-value convention of `ti.func`. The consequence is that the number of instructions is unnecessarily large, which significantly slows down compilation. My solution here is to identify and eliminate such redundant instructions in the first place so all later passes can take a much smaller number of instructions as input. These redundant local vars are essentially immutable ones - they are assigned only once and only loaded after the assignment. In this PR, I add an optimization pass `eliminate_immutable_local_vars` as the first pass. (P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt` are fixed by the way to make the pass work properly.) Let's study the effects in two cases: #6933 and [voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py). First, let's compare the number of instructions after `scalarization` pass (which happens immediately after the first pass). | Kernel | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | `test` (#6933) | 45859 | 26452 | 42% | | `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% | Then, let's compare the total time of `compile()`. | Case | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | #6933 | 20.622s | 8.550s | 59% | | voxel-rt2 | 27.676s | 9.495s | 66% | Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier
added a commit
that referenced
this issue
Dec 30, 2022
…ace_usages_with() (#7001) Issue: #6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in #6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier
added a commit
that referenced
this issue
Jan 5, 2023
Issue: #2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in #6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu
pushed a commit
to feisuzhu/taichi
that referenced
this issue
Jan 5, 2023
…ace_usages_with() (taichi-dev#7001) Issue: taichi-dev#6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in taichi-dev#6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu
pushed a commit
to feisuzhu/taichi
that referenced
this issue
Jan 5, 2023
Issue: taichi-dev#2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels
pushed a commit
to quadpixels/taichi
that referenced
this issue
May 13, 2023
Issue: taichi-dev#6933 ### Brief Summary There are many redundant copies of local vars in the initial IR: ``` <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127] $129 : local store [$100 <- $128] <[Tensor (3, 3) f32]> $130 = alloca $131 = local load [$100] $132 : local store [$130 <- $131] <[Tensor (3, 3) f32]> $133 = alloca $134 = local load [$130] $135 : local store [$133 <- $134] <[Tensor (3, 3) f32]> $136 = alloca $137 = local load [$133] $138 : local store [$136 <- $137] // In fact, `$128` can be used wherever `$136` is loaded. ``` These can come from many places; one of the main sources is the pass-by-value convention of `ti.func`. The consequence is that the number of instructions is unnecessarily large, which significantly slows down compilation. My solution here is to identify and eliminate such redundant instructions in the first place so all later passes can take a much smaller number of instructions as input. These redundant local vars are essentially immutable ones - they are assigned only once and only loaded after the assignment. In this PR, I add an optimization pass `eliminate_immutable_local_vars` as the first pass. (P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt` are fixed by the way to make the pass work properly.) Let's study the effects in two cases: taichi-dev#6933 and [voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py). First, let's compare the number of instructions after `scalarization` pass (which happens immediately after the first pass). | Kernel | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | `test` (taichi-dev#6933) | 45859 | 26452 | 42% | | `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% | Then, let's compare the total time of `compile()`. | Case | Before this PR | After this PR | Rate of decrease | | ------ | ------ | ------ | ------ | | taichi-dev#6933 | 20.622s | 8.550s | 59% | | voxel-rt2 | 27.676s | 9.495s | 66% | Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels
pushed a commit
to quadpixels/taichi
that referenced
this issue
May 13, 2023
…ace_usages_with() (taichi-dev#7001) Issue: taichi-dev#6933 ### Brief Summary `ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`, which visits the whole tree for a single replacement. `ImmediateIRModifier` is currently associated with a pass, visits the whole tree once at the beginning of that pass, and performs a single replacement with amortized constant time. It is now used in two most recent passes, `eliminate_immutable_local_vars` and `scalarize`. More passes can be modified to leverage it in the future. After this PR, the profiling result of the script in taichi-dev#6933 shows that the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to `0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`, and the total time of `compile` reduces from `8.550 s` to `4.696 s`. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels
pushed a commit
to quadpixels/taichi
that referenced
this issue
May 13, 2023
Issue: taichi-dev#2590 ### Brief Summary Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized. It actually produces `2n` more instructions (`n` `ConstStmt`s and n `MatrixPtrStmt`s) than the scalarized setting, where `n` is the number of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to eliminate all the redundant instructions. See comments in the code for details. After this PR, the number of instructions after the `scalarize()` pass of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581, and the compilation time reduces from 20.02s to 7.82s. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Original user post: https://forum.taichi-lang.cn/t/topic/3710
Script to reproduce: https://github.com/JinliBot7/taichi-version-update-problem
The user reports that the script needs <10s to compile with Taichi v1.1.3 but needs ~30s to compile with Taichi v1.3.0.
On my MacBook Pro (13-inch, M1, 2020), I observe that the script needs ~5s to compile with Taichi v1.1.3 and needs ~20s to compile at 6f4ce42, which shows similar slowdown.
The text was updated successfully, but these errors were encountered: