Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Taichi results in slower compilation speed #6933

Closed
strongoier opened this issue Dec 20, 2022 · 0 comments
Closed

Updating Taichi results in slower compilation speed #6933

strongoier opened this issue Dec 20, 2022 · 0 comments
Assignees
Labels
advanced optimization The issue or bug is related to advanced optimization

Comments

@strongoier
Copy link
Contributor

strongoier commented Dec 20, 2022

Original user post: https://forum.taichi-lang.cn/t/topic/3710
Script to reproduce: https://github.com/JinliBot7/taichi-version-update-problem

The user reports that the script needs <10s to compile with Taichi v1.1.3 but needs ~30s to compile with Taichi v1.3.0.

On my MacBook Pro (13-inch, M1, 2020), I observe that the script needs ~5s to compile with Taichi v1.1.3 and needs ~20s to compile at 6f4ce42, which shows similar slowdown.

@strongoier strongoier added the advanced optimization The issue or bug is related to advanced optimization label Dec 20, 2022
@strongoier strongoier self-assigned this Dec 20, 2022
@strongoier strongoier moved this to Untriaged in Taichi Lang Dec 20, 2022
@strongoier strongoier moved this from Untriaged to In Progress in Taichi Lang Dec 20, 2022
strongoier added a commit that referenced this issue Dec 21, 2022
Issue: #6933

### Brief Summary

There are many redundant copies of local vars in the initial IR:
```
  <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127]
  $129 : local store [$100 <- $128]
  <[Tensor (3, 3) f32]> $130 = alloca
  $131 = local load [$100]
  $132 : local store [$130 <- $131]
  <[Tensor (3, 3) f32]> $133 = alloca
  $134 = local load [$130]
  $135 : local store [$133 <- $134]
  <[Tensor (3, 3) f32]> $136 = alloca
  $137 = local load [$133]
  $138 : local store [$136 <- $137]
// In fact, `$128` can be used wherever `$136` is loaded.
```

These can come from many places; one of the main sources is the
pass-by-value convention of `ti.func`. The consequence is that the
number of instructions is unnecessarily large, which significantly slows
down compilation.

My solution here is to identify and eliminate such redundant
instructions in the first place so all later passes can take a much
smaller number of instructions as input. These redundant local vars are
essentially immutable ones - they are assigned only once and only loaded
after the assignment. In this PR, I add an optimization pass
`eliminate_immutable_local_vars` as the first pass.

(P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt`
are fixed by the way to make the pass work properly.)

Let's study the effects in two cases: #6933 and
[voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py).

First, let's compare the number of instructions after `scalarization`
pass (which happens immediately after the first pass).

| Kernel | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| `test` (#6933) | 45859  | 26452 | 42% |
| `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% |

Then, let's compare the total time of `compile()`.

| Case | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| #6933 | 20.622s | 8.550s | 59% |
| voxel-rt2  | 27.676s  | 9.495s | 66% |

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier added a commit that referenced this issue Dec 30, 2022
…ace_usages_with() (#7001)

Issue: #6933

### Brief Summary

`ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`,
which visits the whole tree for a single replacement.
`ImmediateIRModifier` is currently associated with a pass, visits the
whole tree once at the beginning of that pass, and performs a single
replacement with amortized constant time. It is now used in two most
recent passes, `eliminate_immutable_local_vars` and `scalarize`. More
passes can be modified to leverage it in the future.

After this PR, the profiling result of the script in #6933 shows that
the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to
`0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`,
and the total time of `compile` reduces from `8.550 s` to `4.696 s`.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Taichi Lang Jan 4, 2023
strongoier added a commit that referenced this issue Jan 5, 2023
Issue: #2590

### Brief Summary

Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized.
It actually produces `2n` more instructions (`n` `ConstStmt`s and n
`MatrixPtrStmt`s) than the scalarized setting, where `n` is the number
of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to
eliminate all the redundant instructions. See comments in the code for
details.

After this PR, the number of instructions after the `scalarize()` pass
of the script in #6933 under dynamic index reduces from 49589 to 26581,
and the compilation time reduces from 20.02s to 7.82s.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu pushed a commit to feisuzhu/taichi that referenced this issue Jan 5, 2023
…ace_usages_with() (taichi-dev#7001)

Issue: taichi-dev#6933

### Brief Summary

`ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`,
which visits the whole tree for a single replacement.
`ImmediateIRModifier` is currently associated with a pass, visits the
whole tree once at the beginning of that pass, and performs a single
replacement with amortized constant time. It is now used in two most
recent passes, `eliminate_immutable_local_vars` and `scalarize`. More
passes can be modified to leverage it in the future.

After this PR, the profiling result of the script in taichi-dev#6933 shows that
the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to
`0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`,
and the total time of `compile` reduces from `8.550 s` to `4.696 s`.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu pushed a commit to feisuzhu/taichi that referenced this issue Jan 5, 2023
Issue: taichi-dev#2590

### Brief Summary

Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized.
It actually produces `2n` more instructions (`n` `ConstStmt`s and n
`MatrixPtrStmt`s) than the scalarized setting, where `n` is the number
of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to
eliminate all the redundant instructions. See comments in the code for
details.

After this PR, the number of instructions after the `scalarize()` pass
of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581,
and the compilation time reduces from 20.02s to 7.82s.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6933

### Brief Summary

There are many redundant copies of local vars in the initial IR:
```
  <[Tensor (3, 3) f32]> $128 = [$103, $106, $109, $112, $115, $118, $121, $124, $127]
  $129 : local store [$100 <- $128]
  <[Tensor (3, 3) f32]> $130 = alloca
  $131 = local load [$100]
  $132 : local store [$130 <- $131]
  <[Tensor (3, 3) f32]> $133 = alloca
  $134 = local load [$130]
  $135 : local store [$133 <- $134]
  <[Tensor (3, 3) f32]> $136 = alloca
  $137 = local load [$133]
  $138 : local store [$136 <- $137]
// In fact, `$128` can be used wherever `$136` is loaded.
```

These can come from many places; one of the main sources is the
pass-by-value convention of `ti.func`. The consequence is that the
number of instructions is unnecessarily large, which significantly slows
down compilation.

My solution here is to identify and eliminate such redundant
instructions in the first place so all later passes can take a much
smaller number of instructions as input. These redundant local vars are
essentially immutable ones - they are assigned only once and only loaded
after the assignment. In this PR, I add an optimization pass
`eliminate_immutable_local_vars` as the first pass.

(P.S. The type check processes of `MatrixExpression` and `LocalLoadStmt`
are fixed by the way to make the pass work properly.)

Let's study the effects in two cases: taichi-dev#6933 and
[voxel-rt2](https://github.com/taichi-dev/voxel-rt2/blob/main/example7.py).

First, let's compare the number of instructions after `scalarization`
pass (which happens immediately after the first pass).

| Kernel | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| `test` (taichi-dev#6933) | 45859  | 26452 | 42% |
| `spatial_GRIS` (voxel-rt2) | 48519 | 17713 | 63% |

Then, let's compare the total time of `compile()`.

| Case | Before this PR | After this PR | Rate of decrease |
| ------ | ------ | ------ | ------ |
| taichi-dev#6933 | 20.622s | 8.550s | 59% |
| voxel-rt2  | 27.676s  | 9.495s | 66% |

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…ace_usages_with() (taichi-dev#7001)

Issue: taichi-dev#6933

### Brief Summary

`ImmediateIRModifier` aims at replacing `Stmt::replace_usages_with`,
which visits the whole tree for a single replacement.
`ImmediateIRModifier` is currently associated with a pass, visits the
whole tree once at the beginning of that pass, and performs a single
replacement with amortized constant time. It is now used in two most
recent passes, `eliminate_immutable_local_vars` and `scalarize`. More
passes can be modified to leverage it in the future.

After this PR, the profiling result of the script in taichi-dev#6933 shows that
the time of `eliminate_immutable_local_vars` reduces from `0.956 s` to
`0.162 s`, the time of `scalarize` reduces from `3.510 s` to `0.478 s`,
and the total time of `compile` reduces from `8.550 s` to `4.696 s`.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#2590

### Brief Summary

Under pure `dynamic_index` setting, `MatrixPtrStmt`s are not scalarized.
It actually produces `2n` more instructions (`n` `ConstStmt`s and n
`MatrixPtrStmt`s) than the scalarized setting, where `n` is the number
of usages of `MatrixPtrStmt`s. This PR adds `ExtractPointers` pass to
eliminate all the redundant instructions. See comments in the code for
details.

After this PR, the number of instructions after the `scalarize()` pass
of the script in taichi-dev#6933 under dynamic index reduces from 49589 to 26581,
and the compilation time reduces from 20.02s to 7.82s.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced optimization The issue or bug is related to advanced optimization
Projects
Status: Done
Development

No branches or pull requests

1 participant