Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature suggestion] self speculative decoding #61

Open
NewBornRustacean opened this issue May 8, 2024 · 7 comments
Open

[feature suggestion] self speculative decoding #61

NewBornRustacean opened this issue May 8, 2024 · 7 comments

Comments

@NewBornRustacean
Copy link
Contributor

NewBornRustacean commented May 8, 2024

Good morning(or afternoon/ evening)!

There is a methodology called self speculative decoding among the techniques to enhance the speed of LLM inference. Would it be possible to implement this feature in Luminal? If it aligns with Luminal's philosophy, I believe this type of work could greatly contribute to speed improvement! Even though it's not included in the v0.3 roadmap, I'd like to start this task slowly if it's alright.

Summary of abstract
This paper introduces self-speculative decoding, a novel inference scheme designed to accelerate Large Language Models (LLMs) without relying on auxiliary models. It operates in two stages: drafting, which quickly generates draft tokens by selectively skipping intermediate layers, and verification, which validates draft output using the original LLM in a single forward pass. The approach maintains output quality identical to that of unaltered LLMs, without requiring additional neural network training or extra memory, offering a plug-and-play and cost-effective solution for inference acceleration, with benchmarks showing speedups of up to 1.73× on LLaMA-2 and its fine-tuned models.

@jafioti
Copy link
Owner

jafioti commented May 8, 2024

Yes self-speculative decoding is something I've been interested in for a while. I think it's entirely possible, though I'm a little fuzzy on how the layers are actually chosen for the draft pass. If the skipped layers are fixed, as in they don't change between draft passes, then I think it's very straightforward. You can essentially just have a forward() and forward_draft() on the module to do each pass.

I would suggest using two graphs, one for the normal forward pass and one for the draft pass. During inference you can quicky move weights between graphs with the transfer_data() function.

I'm super excited to see where this goes. Lmk if you have any questions! I'd be happy to help

@NewBornRustacean
Copy link
Contributor Author

NewBornRustacean commented May 8, 2024

Thanks! @jafioti

I'm gonna start with two graphs as you suggested.

btw, where do you think is the right place for this feature to be merged?
creating generation.rs either in luminal/crates/luminal_nn/src/ or luminal/crates/luminal_nn/src/transformer seems possible, I guess.

@jafioti
Copy link
Owner

jafioti commented May 8, 2024

I would suggest for now just making it an example (copy llama and rename it llama_speculative) and work off that until it takes shape. Once it works we can see how it fits into the whole ecosystem.

@NewBornRustacean
Copy link
Contributor Author

NewBornRustacean commented May 10, 2024

Good morning! @jafioti

According to original implementation of the paper(draft & verify:Lossless Large Language Model Acceleration via Self-Speculative Decoding), the skipped layers are chosen in advance(gaussian process; I think this is not suitable for runtime execution).

It seems like the Bayesian optimization discussed in the paper is implemented using the Python library "bayes_opt". Once the skipped layers are determined, it doesn't seem like they change during the draft pass. However, in my opinion, depending on the type of LLM model and the prompt, the skipped layers could potentially change. So, I'm thinking of implementing a function that takes skipped layers as input for now(maybe const generic).

@jafioti
Copy link
Owner

jafioti commented May 10, 2024

Sure sounds good👍

@jafioti
Copy link
Owner

jafioti commented Jun 5, 2024

Hey @NewBornRustacean , how's it been going with speculative decoding? If you need any help, feel free to reach out. Happy to talk if you're stuck on anything.

@NewBornRustacean
Copy link
Contributor Author

Thanks for your comment! Actually, I've been so busy with work for the past few weeks that I haven't been able to get much done. This week is a holiday in Korea, so I think I'll finally have some time! If I run into any difficulties, I'll ask for help right away.

Btw, the slides you shared(at disocrd) were very helpful to understand the concepts.

I've been reading the Mirage paper you shared during my commute, and although it's difficult, I find it interesting. I want to learn more about this topic so I can contribute more to Luminal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants