Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have any plans to optimize the decode kernel for NV-Hopper #576

Open
JamesLim-sy opened this issue Oct 31, 2024 · 3 comments
Open

Have any plans to optimize the decode kernel for NV-Hopper #576

JamesLim-sy opened this issue Oct 31, 2024 · 3 comments

Comments

@JamesLim-sy
Copy link

I noticed hopper cluster setting may have a chance to optimize the performance of batch_decode by merging VariableLengthMergeStates with BatchDecodeWithPagedKVCacheKernel. Is there any plan to use SM90 features for it ?

@zhyncs
Copy link
Member

zhyncs commented Oct 31, 2024

Is there any plan to use SM90 features for it?

ref #507 (comment)

@yzh119
Copy link
Collaborator

yzh119 commented Nov 2, 2024

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct?
I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

@JamesLim-sy
Copy link
Author

JamesLim-sy commented Nov 8, 2024

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

@yzh119 Yes, after my profiling, time cost by merge_stats kernel occupied around 10% of whole time cost by total decode_attention operation including batch_decode and merge_stats. Also, i think introduction of cluster may shrink the memory allocation for lse in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants