Have any plans to optimize the decode kernel for NV-Hopper #576

JamesLim-sy · 2024-10-31T08:21:28Z

I noticed hopper cluster setting may have a chance to optimize the performance of batch_decode by merging VariableLengthMergeStates with BatchDecodeWithPagedKVCacheKernel. Is there any plan to use SM90 features for it ?

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-10-31T09:03:34Z

Is there any plan to use SM90 features for it?

ref #507 (comment)

yzh119 · 2024-11-02T02:37:06Z

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct?
I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

JamesLim-sy · 2024-11-08T03:44:08Z

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

@yzh119 Yes, after my profiling, time cost by merge_stats kernel occupied around 10% of whole time cost by total decode_attention operation including batch_decode and merge_stats. Also, i think introduction of cluster may shrink the memory allocation for lse in some cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have any plans to optimize the decode kernel for NV-Hopper #576

Have any plans to optimize the decode kernel for NV-Hopper #576

JamesLim-sy commented Oct 31, 2024

zhyncs commented Oct 31, 2024

yzh119 commented Nov 2, 2024 •

edited

Loading

JamesLim-sy commented Nov 8, 2024 •

edited

Loading

Have any plans to optimize the decode kernel for NV-Hopper #576

Have any plans to optimize the decode kernel for NV-Hopper #576

Comments

JamesLim-sy commented Oct 31, 2024

zhyncs commented Oct 31, 2024

yzh119 commented Nov 2, 2024 • edited Loading

JamesLim-sy commented Nov 8, 2024 • edited Loading

yzh119 commented Nov 2, 2024 •

edited

Loading

JamesLim-sy commented Nov 8, 2024 •

edited

Loading