[RFC] Unify packed mode and non-packed mode #6660

strongoier · 2022-11-18T06:59:25Z

Motivation

Packed mode was introduced to allow users to trade running time for memory usage:

taichi/docs/lang/articles/basic/layout.md

Lines 362 to 380 in db6ab0d

    
           ### Packed mode 
        
           By default, Taichi implicitly fits a field in a larger buffer with power-of-two dimensions. We take the power-of-two padding convention because it is widely adopted in computer graphics. The design enables fast indexing with bitwise arithmetic and better memory address alignment, while trading off memory occupations. 
        
           For example, a `(18, 65)` field is materialized with a `(32, 128)` buffer, which is acceptable. As field size grows, the padding strategy can be exaggeratedly unbearable: `(129, 6553600)` will be expanded to `(256, 6335600)`, which allocates considerable unused blank memory. Therefore, Taichi provides the optional packed mode to allocate buffer that tightly fits the requested field shape. It is especially useful when memory usage is a major concern. 
        
           To leverage the packed mode, specify `packed` in `ti.init()` argument: 
        
           ```python 
        
           ti.init()  # default: packed=False 
        
           a = ti.field(ti.i32, shape=(18, 65))  # padded to (32, 128) 
        
           ``` 
        
           ```python 
        
           ti.init(packed=True) 
        
           a = ti.field(ti.i32, shape=(18, 65))  # no padding 
        
           ``` 
        
           You might observe mild performance regression with the packed mode due to more complex addressing and memory alignment. Therefore, the packed mode should be specified only when memory capacity is a major concern.

A more detailed introduction can be found at https://github.com/taichi-dev/taichicon/releases/download/taichicon02/Packed.Mode-Effectively.Reducing.Memory.Costs.of.Non-power-of-two.Fields-Yi.Xu.pdf.

However, simply providing users a switch is far from perfect. The default automatic padding behavior is non-intuitive to newcomers, so it is hard for them to diagnose related problems (e.g. 2-10x larger memory usage than expected, or sparsity manipulation in a wrong way) and then find out this feature.

To improve user experience, I propose to unify packed mode and the default non-packed mode - making the memory allocation and coordinate mapping behavior of Taichi SNodes as straightforward to understand as under packed mode but the runtime performance in most cases as optimized as under non-packed mode.

Proposal

The main idea is to try our best to optimize the runtime overhead of packed mode and then make it the default mode.

#6444 has taken the first step. It has made overhead of SNode access under normal cases (no second-time division on one axis) the same for both packed mode and non-packed mode.

After a discussion with @yuanming-hu, we agree that when users make a second-time division on some axis, it is very likely that they are prioritizing runtime performance and in the process of fine-tuning memory access patterns. In this case, it makes no sense to introduce slow operations like mod or div in each access. Therefore, we can add a limitation on the definition of SNodes - the non-first-time division on one axis must be a power of two. With this limitation, we can make the runtime overhead of SNode access always the same for both modes.

The only remaining parts involving coordinate calculation are struct fors. As long as the struct for is operating on a SNode whose path to root contains no non-power-of-two division, we can make the runtime overhead of the struct for the same for both modes. Unfortunately, for the remaining case, where the struct for is operating on a SNode whose path to root does contain some non-power-of-two division, the runtime overhead of the struct for for packed mode will be higher. Luckily, the runtime overhead of a struct for only appears once in each thread and is unlikely to become the performance bottleneck. In the extreme case where it does become the bottleneck, the user can change that non-power-of-two division to power-of-two as an optimization strategy.

Plan

Add the limitation above and fix tests violating the limitation.
- Update: We will make the limitation a warning and leverage the non-optimized version of packed mode if the limitation is violated. In this way, users will not find their code broken and can learn the best practice from the warning gradually.
Make the runtime overhead of SNode access always the same for both modes if the limitation is obeyed.
Make the runtime overhead of struct for on a SNode whose path to root contains no non-power-of-two division always the same for both modes.
Check and add missing packed mode support for non-LLVM backends.
Make packed mode as default.
- Update: In v1.3.0 we will still preserve packed=False as a fallback option in case the new packed mode causes trouble. If things go well, in v1.4.0 we will remove the packed switch and the non-packed mode.
Update doc.

The text was updated successfully, but these errors were encountered:

jim19930609 · 2022-11-22T02:45:10Z

In this case, it makes no sense to introduce slow operations like mod or div in each access

Was wondering what's the approximate diff in performance though. If the performance isn't too bad, would it be better if we turn the power of 2 limitation into a warning for now?

In this way, new bees will get notified if they're concerned with the performance, while advanced users should know what they're doing if they intend to break the limit.

strongoier · 2022-11-22T03:42:43Z

Was wondering what's the approximate diff in performance though.

#6219 gives a ~2x result.

If the performance isn't too bad, would it be better if we turn the power of 2 limitation into a warning for now?
In this way, new bees will get notified if they're concerned with the performance, while advanced users should know what they're doing if they intend to break the limit.

I think the core problem here is whether we want to force or simply advise users to follow the best practice. I agree that giving warnings instead of raising errors is a more gentle solution, which will not break potential existing code and can still achieve the purpose of guiding users towards best practice. @yuanming-hu WDYT?

ailzhang · 2022-11-22T03:49:38Z

IMHO we could give a deprecation notice in v1.3 saying that we consider enforce power of 2 for second-time division in v1.4, if they have a valid use case they should open an issue to let us know. This way we can know better how heavily this feature is used atm and not limiting ourselves in unknown use cases. wdyt?

yuanming-hu · 2022-11-22T06:36:42Z

I agree that giving warnings instead of raising errors is a more gentle solution, which will not break potential existing code and can still achieve the purpose of guiding users towards best practice. @yuanming-hu WDYT?

Sounds good!

…ower of two (#6690) Issue: #6660 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…d mode (#6709) Issue: #6660 ### Brief Summary This PR applies the same optimization in #6444 to the `demote_dense_struct_fors` pass. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…6718) Issue: #6660 ### Brief Summary After this PR, the two main goals in #6660, > - Make the runtime overhead of SNode access always the same for both modes if the limitation is obeyed. > - Make the runtime overhead of struct for on a SNode whose path to root contains no non-power-of-two division always the same for both modes. are both achieved. See comments in the code for implementation details. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: #6660 ### Brief Summary This PR does the following: - Provides missing part of packed mode support for Metal sparse; - Removes `ti.extension.packed`; - Sets `packed=True` by default; - Fixes illegal tests. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: #6660 ### Brief Summary We no longer need the concept of packed mode. Now we only have to suggest users using power-of-two block size in hierarchical fields.

…6753) Issue: #6660 ### Brief Summary Apart from the deprecated warning, the previously added limitation that "non-first division on an axis must be a power of two" is now made purely a warning because it can already let users learn the best practice. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6660 ### Brief Summary We no longer need the concept of packed mode. Now we only have to suggest users using power-of-two block size in hierarchical fields.

…aichi-dev#6753) Issue: taichi-dev#6660 ### Brief Summary Apart from the deprecated warning, the previously added limitation that "non-first division on an axis must be a power of two" is now made purely a warning because it can already let users learn the best practice. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…ower of two (taichi-dev#6690) Issue: taichi-dev#6660 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…d mode (taichi-dev#6709) Issue: taichi-dev#6660 ### Brief Summary This PR applies the same optimization in taichi-dev#6444 to the `demote_dense_struct_fors` pass. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…aichi-dev#6718) Issue: taichi-dev#6660 ### Brief Summary After this PR, the two main goals in taichi-dev#6660, > - Make the runtime overhead of SNode access always the same for both modes if the limitation is obeyed. > - Make the runtime overhead of struct for on a SNode whose path to root contains no non-power-of-two division always the same for both modes. are both achieved. See comments in the code for implementation details. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6660 ### Brief Summary This PR does the following: - Provides missing part of packed mode support for Metal sparse; - Removes `ti.extension.packed`; - Sets `packed=True` by default; - Fixes illegal tests. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6660 ### Brief Summary We no longer need the concept of packed mode. Now we only have to suggest users using power-of-two block size in hierarchical fields.

…aichi-dev#6753) Issue: taichi-dev#6660 ### Brief Summary Apart from the deprecated warning, the previously added limitation that "non-first division on an axis must be a power of two" is now made purely a warning because it can already let users learn the best practice. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

strongoier added the feature request Suggest an idea on this project label Nov 18, 2022

taichi-gardener added this to Taichi Lang Nov 18, 2022

strongoier added RFC and removed feature request Suggest an idea on this project labels Nov 18, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Nov 18, 2022

strongoier self-assigned this Nov 18, 2022

strongoier added this to the v1.3.0 milestone Nov 18, 2022

PENGUINLIONG moved this from Untriaged to In Progress in Taichi Lang Nov 18, 2022

strongoier mentioned this issue Nov 21, 2022

[Lang] Limit non-first division of an axis on a SNodeTree path to a power of two #6690

Merged

strongoier mentioned this issue Nov 22, 2022

[opt] Eliminate redundant mod in demote_dense_struct_fors under packed mode #6709

Merged

strongoier added a commit that referenced this issue Nov 23, 2022

[Lang] Limit non-first division of an axis on a SNodeTree path to a p…

91b6735

…ower of two (#6690) Issue: #6660 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

strongoier mentioned this issue Nov 23, 2022

[opt] Turn mod/div into bit_and/bit_shr if possible in packed mode #6718

Merged

strongoier mentioned this issue Nov 24, 2022

[Lang] Enable packed mode by default #6721

Merged

This was referenced Nov 28, 2022

[Lang] Add deprecation warning for the removal of the packed switch #6753

Merged

[Doc] Stop mentioning packed mode #6755

Merged

neozhaoliang pushed a commit that referenced this issue Nov 28, 2022

[Doc] Stop mentioning packed mode (#6755)

ee27518

Issue: #6660 ### Brief Summary We no longer need the concept of packed mode. Now we only have to suggest users using power-of-two block size in hierarchical fields.

strongoier closed this as completed Nov 30, 2022

Repository owner moved this from In Progress to Done in Taichi Lang Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Unify packed mode and non-packed mode #6660

[RFC] Unify packed mode and non-packed mode #6660

strongoier commented Nov 18, 2022 •

edited

Loading

jim19930609 commented Nov 22, 2022 •

edited

Loading

strongoier commented Nov 22, 2022

ailzhang commented Nov 22, 2022

yuanming-hu commented Nov 22, 2022

[RFC] Unify packed mode and non-packed mode #6660

[RFC] Unify packed mode and non-packed mode #6660

Comments

strongoier commented Nov 18, 2022 • edited Loading

Motivation

Proposal

Plan

jim19930609 commented Nov 22, 2022 • edited Loading

strongoier commented Nov 22, 2022

ailzhang commented Nov 22, 2022

yuanming-hu commented Nov 22, 2022

strongoier commented Nov 18, 2022 •

edited

Loading

jim19930609 commented Nov 22, 2022 •

edited

Loading