[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads#2970
Open
yiakwy-xpu-ml-framework-team wants to merge 2 commits intosgl-project:mainfrom yiakwy-xpu-ml-framework-team:try_to_optimize_moe_align_block_size_cuda_kernel
+111-15