[MOE] try to optimize cu kernel single block execution - distribute cumsum workload from thread 0 to other threads#2970

Open

yiakwy-xpu-ml-framework-team wants to merge 2 commits intosgl-project:mainfrom yiakwy-xpu-ml-framework-team:try_to_optimize_moe_align_block_size_cuda_kernel

+111-15

Commits on Jan 20, 2025