Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Unify packed mode and non-packed mode #6660

Closed
6 tasks done
strongoier opened this issue Nov 18, 2022 · 4 comments
Closed
6 tasks done

[RFC] Unify packed mode and non-packed mode #6660

strongoier opened this issue Nov 18, 2022 · 4 comments
Assignees
Labels
Milestone

Comments

@strongoier
Copy link
Contributor

strongoier commented Nov 18, 2022

Motivation

Packed mode was introduced to allow users to trade running time for memory usage:

### Packed mode
By default, Taichi implicitly fits a field in a larger buffer with power-of-two dimensions. We take the power-of-two padding convention because it is widely adopted in computer graphics. The design enables fast indexing with bitwise arithmetic and better memory address alignment, while trading off memory occupations.
For example, a `(18, 65)` field is materialized with a `(32, 128)` buffer, which is acceptable. As field size grows, the padding strategy can be exaggeratedly unbearable: `(129, 6553600)` will be expanded to `(256, 6335600)`, which allocates considerable unused blank memory. Therefore, Taichi provides the optional packed mode to allocate buffer that tightly fits the requested field shape. It is especially useful when memory usage is a major concern.
To leverage the packed mode, specify `packed` in `ti.init()` argument:
```python
ti.init() # default: packed=False
a = ti.field(ti.i32, shape=(18, 65)) # padded to (32, 128)
```
```python
ti.init(packed=True)
a = ti.field(ti.i32, shape=(18, 65)) # no padding
```
You might observe mild performance regression with the packed mode due to more complex addressing and memory alignment. Therefore, the packed mode should be specified only when memory capacity is a major concern.

A more detailed introduction can be found at https://github.com/taichi-dev/taichicon/releases/download/taichicon02/Packed.Mode-Effectively.Reducing.Memory.Costs.of.Non-power-of-two.Fields-Yi.Xu.pdf.

However, simply providing users a switch is far from perfect. The default automatic padding behavior is non-intuitive to newcomers, so it is hard for them to diagnose related problems (e.g. 2-10x larger memory usage than expected, or sparsity manipulation in a wrong way) and then find out this feature.

To improve user experience, I propose to unify packed mode and the default non-packed mode - making the memory allocation and coordinate mapping behavior of Taichi SNodes as straightforward to understand as under packed mode but the runtime performance in most cases as optimized as under non-packed mode.

Proposal

The main idea is to try our best to optimize the runtime overhead of packed mode and then make it the default mode.

#6444 has taken the first step. It has made overhead of SNode access under normal cases (no second-time division on one axis) the same for both packed mode and non-packed mode.

After a discussion with @yuanming-hu, we agree that when users make a second-time division on some axis, it is very likely that they are prioritizing runtime performance and in the process of fine-tuning memory access patterns. In this case, it makes no sense to introduce slow operations like mod or div in each access. Therefore, we can add a limitation on the definition of SNodes - the non-first-time division on one axis must be a power of two. With this limitation, we can make the runtime overhead of SNode access always the same for both modes.

The only remaining parts involving coordinate calculation are struct fors. As long as the struct for is operating on a SNode whose path to root contains no non-power-of-two division, we can make the runtime overhead of the struct for the same for both modes. Unfortunately, for the remaining case, where the struct for is operating on a SNode whose path to root does contain some non-power-of-two division, the runtime overhead of the struct for for packed mode will be higher. Luckily, the runtime overhead of a struct for only appears once in each thread and is unlikely to become the performance bottleneck. In the extreme case where it does become the bottleneck, the user can change that non-power-of-two division to power-of-two as an optimization strategy.

Plan

  • Add the limitation above and fix tests violating the limitation.
    • Update: We will make the limitation a warning and leverage the non-optimized version of packed mode if the limitation is violated. In this way, users will not find their code broken and can learn the best practice from the warning gradually.
  • Make the runtime overhead of SNode access always the same for both modes if the limitation is obeyed.
  • Make the runtime overhead of struct for on a SNode whose path to root contains no non-power-of-two division always the same for both modes.
  • Check and add missing packed mode support for non-LLVM backends.
  • Make packed mode as default.
    • Update: In v1.3.0 we will still preserve packed=False as a fallback option in case the new packed mode causes trouble. If things go well, in v1.4.0 we will remove the packed switch and the non-packed mode.
  • Update doc.
@strongoier strongoier added the feature request Suggest an idea on this project label Nov 18, 2022
@strongoier strongoier added RFC and removed feature request Suggest an idea on this project labels Nov 18, 2022
@taichi-gardener taichi-gardener moved this to Untriaged in Taichi Lang Nov 18, 2022
@strongoier strongoier self-assigned this Nov 18, 2022
@strongoier strongoier added this to the v1.3.0 milestone Nov 18, 2022
@PENGUINLIONG PENGUINLIONG moved this from Untriaged to In Progress in Taichi Lang Nov 18, 2022
@jim19930609
Copy link
Contributor

jim19930609 commented Nov 22, 2022

In this case, it makes no sense to introduce slow operations like mod or div in each access

Was wondering what's the approximate diff in performance though. If the performance isn't too bad, would it be better if we turn the power of 2 limitation into a warning for now?

In this way, new bees will get notified if they're concerned with the performance, while advanced users should know what they're doing if they intend to break the limit.

@strongoier
Copy link
Contributor Author

Was wondering what's the approximate diff in performance though.

#6219 gives a ~2x result.

If the performance isn't too bad, would it be better if we turn the power of 2 limitation into a warning for now?
In this way, new bees will get notified if they're concerned with the performance, while advanced users should know what they're doing if they intend to break the limit.

I think the core problem here is whether we want to force or simply advise users to follow the best practice. I agree that giving warnings instead of raising errors is a more gentle solution, which will not break potential existing code and can still achieve the purpose of guiding users towards best practice. @yuanming-hu WDYT?

@ailzhang
Copy link
Contributor

IMHO we could give a deprecation notice in v1.3 saying that we consider enforce power of 2 for second-time division in v1.4, if they have a valid use case they should open an issue to let us know. This way we can know better how heavily this feature is used atm and not limiting ourselves in unknown use cases. wdyt?

@yuanming-hu
Copy link
Member

I agree that giving warnings instead of raising errors is a more gentle solution, which will not break potential existing code and can still achieve the purpose of guiding users towards best practice. @yuanming-hu WDYT?

Sounds good!

strongoier added a commit that referenced this issue Nov 23, 2022
…ower of two (#6690)

Issue: #6660

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier added a commit that referenced this issue Nov 23, 2022
…d mode (#6709)

Issue: #6660

### Brief Summary

This PR applies the same optimization in #6444 to the
`demote_dense_struct_fors` pass.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier added a commit that referenced this issue Nov 24, 2022
…6718)

Issue: #6660

### Brief Summary

After this PR, the two main goals in #6660,

> - Make the runtime overhead of SNode access always the same for both
modes if the limitation is obeyed.
> - Make the runtime overhead of struct for on a SNode whose path to
root contains no non-power-of-two division always the same for both
modes.

are both achieved.

See comments in the code for implementation details.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier added a commit that referenced this issue Nov 25, 2022
Issue: #6660

### Brief Summary

This PR does the following:
- Provides missing part of packed mode support for Metal sparse;
- Removes `ti.extension.packed`;
- Sets `packed=True` by default;
- Fixes illegal tests.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
neozhaoliang pushed a commit that referenced this issue Nov 28, 2022
Issue: #6660

### Brief Summary

We no longer need the concept of packed mode. Now we only have to
suggest users using power-of-two block size in hierarchical fields.
strongoier added a commit that referenced this issue Nov 29, 2022
…6753)

Issue: #6660

### Brief Summary

Apart from the deprecated warning, the previously added limitation that
"non-first division on an axis must be a power of two" is now made
purely a warning because it can already let users learn the best
practice.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
strongoier added a commit to strongoier/taichi that referenced this issue Nov 29, 2022
Issue: taichi-dev#6660

### Brief Summary

We no longer need the concept of packed mode. Now we only have to
suggest users using power-of-two block size in hierarchical fields.
strongoier added a commit to strongoier/taichi that referenced this issue Nov 29, 2022
…aichi-dev#6753)

Issue: taichi-dev#6660

### Brief Summary

Apart from the deprecated warning, the previously added limitation that
"non-first division on an axis must be a power of two" is now made
purely a warning because it can already let users learn the best
practice.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Repository owner moved this from In Progress to Done in Taichi Lang Nov 30, 2022
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…ower of two (taichi-dev#6690)

Issue: taichi-dev#6660

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…d mode (taichi-dev#6709)

Issue: taichi-dev#6660

### Brief Summary

This PR applies the same optimization in taichi-dev#6444 to the
`demote_dense_struct_fors` pass.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…aichi-dev#6718)

Issue: taichi-dev#6660

### Brief Summary

After this PR, the two main goals in taichi-dev#6660,

> - Make the runtime overhead of SNode access always the same for both
modes if the limitation is obeyed.
> - Make the runtime overhead of struct for on a SNode whose path to
root contains no non-power-of-two division always the same for both
modes.

are both achieved.

See comments in the code for implementation details.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6660

### Brief Summary

This PR does the following:
- Provides missing part of packed mode support for Metal sparse;
- Removes `ti.extension.packed`;
- Sets `packed=True` by default;
- Fixes illegal tests.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6660

### Brief Summary

We no longer need the concept of packed mode. Now we only have to
suggest users using power-of-two block size in hierarchical fields.
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
…aichi-dev#6753)

Issue: taichi-dev#6660

### Brief Summary

Apart from the deprecated warning, the previously added limitation that
"non-first division on an axis must be a power of two" is now made
purely a warning because it can already let users learn the best
practice.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

4 participants