You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Concisely describe the proposed feature
Prefix Sum is used extensively for many particle simulation applications. Currently, the taichi implementation in taichi.algorithm is not extensively optimized compared to the Cupy implementation using CUB backend. I did a side by side comparison of cupy.cumsum with the PrefixSumExecutor and noticed that for size 10 million array, cupy takes 300us while taichi takes 550us. It would be beneficial to add a more optimized PrefixSum algorithm (potentially with allocation of output).
Additional comments
Here are the code I used for benchmarking.
import taichi as ti
from taichi.algorithms._algorithms import PrefixSumExecutor
ti.init(arch=ti.gpu, kernel_profiler=True)
array_size = 10000000
array = ti.field(dtype=ti.i32, shape=array_size)
array.fill(1)
PrefixSum = PrefixSumExecutor(array_size)
for i in range(10000):
PrefixSum.run(array)
ti.profiler.print_kernel_profiler_info()
The text was updated successfully, but these errors were encountered:
Concisely describe the proposed feature
Prefix Sum is used extensively for many particle simulation applications. Currently, the taichi implementation in taichi.algorithm is not extensively optimized compared to the Cupy implementation using CUB backend. I did a side by side comparison of cupy.cumsum with the PrefixSumExecutor and noticed that for size 10 million array, cupy takes 300us while taichi takes 550us. It would be beneficial to add a more optimized PrefixSum algorithm (potentially with allocation of output).
Additional comments
Here are the code I used for benchmarking.
The text was updated successfully, but these errors were encountered: