-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature suggestion] self speculative decoding #61
Comments
Yes self-speculative decoding is something I've been interested in for a while. I think it's entirely possible, though I'm a little fuzzy on how the layers are actually chosen for the draft pass. If the skipped layers are fixed, as in they don't change between draft passes, then I think it's very straightforward. You can essentially just have a forward() and forward_draft() on the module to do each pass. I would suggest using two graphs, one for the normal forward pass and one for the draft pass. During inference you can quicky move weights between graphs with the transfer_data() function. I'm super excited to see where this goes. Lmk if you have any questions! I'd be happy to help |
Thanks! @jafioti I'm gonna start with two graphs as you suggested. btw, where do you think is the right place for this feature to be merged? |
I would suggest for now just making it an example (copy llama and rename it llama_speculative) and work off that until it takes shape. Once it works we can see how it fits into the whole ecosystem. |
Good morning! @jafioti According to original implementation of the paper(draft & verify:Lossless Large Language Model Acceleration via Self-Speculative Decoding), the skipped layers are chosen in advance(gaussian process; I think this is not suitable for runtime execution). It seems like the Bayesian optimization discussed in the paper is implemented using the Python library "bayes_opt". Once the skipped layers are determined, it doesn't seem like they change during the draft pass. However, in my opinion, depending on the type of LLM model and the prompt, the skipped layers could potentially change. So, I'm thinking of implementing a function that takes skipped layers as input for now(maybe const generic). |
Sure sounds good👍 |
Hey @NewBornRustacean , how's it been going with speculative decoding? If you need any help, feel free to reach out. Happy to talk if you're stuck on anything. |
Thanks for your comment! Actually, I've been so busy with work for the past few weeks that I haven't been able to get much done. This week is a holiday in Korea, so I think I'll finally have some time! If I run into any difficulties, I'll ask for help right away. Btw, the slides you shared(at disocrd) were very helpful to understand the concepts. I've been reading the Mirage paper you shared during my commute, and although it's difficult, I find it interesting. I want to learn more about this topic so I can contribute more to Luminal! |
Good morning(or afternoon/ evening)!
There is a methodology called self speculative decoding among the techniques to enhance the speed of LLM inference. Would it be possible to implement this feature in Luminal? If it aligns with Luminal's philosophy, I believe this type of work could greatly contribute to speed improvement! Even though it's not included in the v0.3 roadmap, I'd like to start this task slowly if it's alright.
The text was updated successfully, but these errors were encountered: