-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any parallel sorting on the GPU with Taichi or thirdparty ? #3764
Comments
Hi Quentin, Regarding to the atomic assign operation, would this work (doc)?
|
Hello back, It worked for the atomics, thanks! Now, do you know how to get the max number of global threads? |
Hi, I need to take a deeper look into the paper, but I think Taichi hasn't exposed the max num of global threads. A Taichi kernel may be compiled to e.g., multiple CUDA kernels and they could have different number of threads. But this is roughly how we calculated (code link):
This may not help much, we do have a |
We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;) |
Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example. |
Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start... |
Integration of any platform-specific library would definitely make Taichi less consistent, especially when switching among hardware backends. I totally agree that vendor lock-in should be avoided for Taichi, especially in the current stage. The alternative approach is to mimic vendor library with some consistent primitives. However, it has already been proven to be a hard task in TVM. The TVM community recently adopts the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary. If we widen the concept of vendor libraries, they are actually equivalent with user-written codes. The second problem raises here: how to reasonably hybridize compiler generated code with user-written / vendor code? Those codes are probably to be efficient but they are just black boxes for the compiler. I don't think there's already an efficient and elegant solution, hope to see that happen in Taichi :) |
@GrapixLeGrand following your work here with interest! If you get a working implementation, please post details back here if you are able to share :-) |
This sounds like an interesting direction! FYI we have some basic experiments regarding how to integrate with vendor/external libs. One example is the Sparse Matrix feature on top of Eigen (#2792). Another one being @squarefk 's external LLVM bitcode call support (#2873). Clearly, these are prototypes, rather than a systematic solution. Realistically speaking, I don't think we should strive for a completely vendor-free implementation here, or the dev/maintainence cost is gonna be too high :-( I believe either the BYOC or provide a flexible extension system to wrap these vendor libs are worth a shot.. |
Implemented a version of odd-even parallel sort at #3790. |
Hi @GrapixLeGrand. We decide to make more effort on GPU sorting recently. Though we know users prefer to use counting sort/radix sort, we want to know if our odd-even sorting meets your need? If you can share your sorting implementation here, we are willing to help and provide some support in Taichi. |
Hello,
I could not find a thread (nor in the docs) that would cover this subject, therefore I created one. Is there any parallel sorting algorithm in taichi? I'm looking for a radix/counting sort. Is there a known taichi implementation of a radix/counting sort out there? I was considering writing my own following this paper. However, according to the doc (under the atomics section), the atomic assign operation is missing and it would be needed for step 4 of the algorithm
b[c[r[a[i]]]++] = a[i];
.Best,
Quentin
The text was updated successfully, but these errors were encountered: