Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support basic sparsity SNode on Metal #593

Closed
k-ye opened this issue Mar 14, 2020 · 8 comments
Closed

Support basic sparsity SNode on Metal #593

k-ye opened this issue Mar 14, 2020 · 8 comments
Assignees
Labels
feature request Suggest an idea on this project mac Mac OS X platform

Comments

@k-ye
Copy link
Member

k-ye commented Mar 14, 2020

Concisely describe the proposed feature

I'd like to support bitmasked() on the Metal backend. This decorator is the easiest sparsity feature to support, because it does not require dynamic memory allocation on the device side.

Describe the solution you'd like (if any)

I have a seemingly working solution in https://github.com/k-ye/taichi/tree/mtlbit. It (seems to) works on a toy example, but I need more tests (maybe good to see if it can work on a modified taichi_sparse.py).

I tried to follow LLVM's runtime system as much as possible, e.g. runtime's listgen and snode's activate, is_active, etc. One thing that trips me is that, it is still not entirely clear to me how the coordinate refinement works.

That said, I think I found a simpler approach that fits under Metal backend's current implementation. Instead of storing Element inside the list for each SNode

struct Element {
Ptr element;
int loop_bounds[2];
PhysicalCoordinates pcoord;
};

The Element for the Metal backend looks like this:

    struct ListgenElement {
      int32_t loop_index = 0;
      int32_t root_mem_offset = 0;  // used by is_active()
    };

Inside a struct_for kernel, we iterate through the ListManager of that SNode, and loop_index here can be used as the "thread_id". The rest of the stmt IRs already knows how to map this loop_index into the correct leaf node through a series of OffsetAndExtractBitsStmt and SNodeLookupStmt. Do you see any problem with this approach?

Additional comments

  • The entire runtime structs and methods are emitted as strings, making it very painful to maintain now... I'll probably switch to a source code file-based solution eventually.
  • Be super careful about the coronavirus & please stay at home!!
@k-ye k-ye added the feature request Suggest an idea on this project label Mar 14, 2020
@k-ye k-ye self-assigned this Mar 14, 2020
@k-ye k-ye added the mac Mac OS X platform label Mar 14, 2020
@yuanming-hu
Copy link
Member

Inside a struct_for kernel, we iterate through the ListManager of that SNode, and loop_index here can be used as the "thread_id". The rest of the stmt IRs already knows how to map this loop_index into the correct leaf node through a series of OffsetAndExtractBitsStmt and SNodeLookupStmt. Do you see any problem with this approach?

I think it's a space-time-scalability tradeoff issue:

  • The LLVM runtime approach stored an i32 per coordinate. Currently, taichi supports up to 8-D tensors. This means the extreme case would support 2^256 (background) voxels. The downside is more storage/bandwidth consumption per list element.
  • Your Metal implementation saves space but limits the total amount of background voxels to 2^32. The struct-for loops will also have to recompute the physical coordinates according to the compressed i32 loop_index.

My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.

Meanwhile, I'll document the confusing refine_coordinate function (#597)

@k-ye
Copy link
Member Author

k-ye commented Mar 16, 2020

My suggestion is to stick to your Metal implementation for now, and gradually switch to the LLVM style for consistent behavior on different backends.

I've implemented refine_coordinate now that I get what it's doing 😄

Here's the ported taichi_sparse.py using bitmasked().

sparse

It looks a bit different from the original example, note the screen-burning effect (残影?? what's the english word of it...) I'm not sure if this is due to the difference between bitmasked and pointer, or a bug in my implementation. However, when i switched to x64 and ran it again, it just crashed...

I'll do some clean up and break down the implementation for review.


BTW, I have a question about non-power-of-two sizes. Say we have the following hierarchy:

root.dense(ti.i, 3).dense(ti.i, 5)

Inside Taichi, it will be padded to POT, i.e. S1.n=4, S2.n=8.

Then, for an index at 11, logically it should belong to (S1@2, S2@1) because 2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at (S1@1, S2@3) (1 * 8 + 3 = 11). Is this expected?

@yuanming-hu
Copy link
Member

Then, for an index at 11, logically it should belong to (S1@2, S2@1) because 2 * 5 + 1 = 11. However, due to the padding, it seems that this is actually located at (S1@1, S2@3) (1 * 8 + 3 = 11). Is this expected?

Yes, it is expected. Note that refine_coordinates limits all the operands within it to be powers of two. This is for performance considerations. Otherwise, we'll end up with very expensive integer division and mod...

@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 16, 2020

Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?

In garbage collection, we zero-fill the deactivated blocks to make sure they are 0 when reactivated. GC is invoked after offload statements with deactivation.

(Note that we are assuming no activation/deactivation on the same SNode can happen within the same offloaded task...Currently there's no compile-time check for this though. #607)

@k-ye
Copy link
Member Author

k-ye commented Mar 16, 2020

Maybe the screen-burning is because bitmasked blocks are not filled-with-zero when re-activated?

You are right :)! I skipped GC tasks completely on Metal, but didn't realize it also zero-filled the elements. After doing a .fill(0) it looks correct:

sparse

@k-ye
Copy link
Member Author

k-ye commented Mar 18, 2020

A few optimizations to consider:

  • Generate refine_coordinates for each SNode. This way all the bits manipulation's operands can be baked in at compile time. It also avoids the access to extractors in the global memory.
  • Have a better way to run the hierarchical listgen. Right now each thread covers one SNode, and if that SNode has a big branch-out factor, then we are likely under-utilizing the GPU resources. Figure out a way so that the workload can be balanced in a more fine-grained way.

@k-ye
Copy link
Member Author

k-ye commented Mar 27, 2020

Some cleanups:

@k-ye
Copy link
Member Author

k-ye commented Mar 29, 2020

Superseded by #678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project mac Mac OS X platform
Projects
None yet
Development

No branches or pull requests

2 participants