[feature suggestion] self speculative decoding #61

NewBornRustacean · 2024-05-08T04:14:03Z

Good morning(or afternoon/ evening)!

There is a methodology called self speculative decoding among the techniques to enhance the speed of LLM inference. Would it be possible to implement this feature in Luminal? If it aligns with Luminal's philosophy, I believe this type of work could greatly contribute to speed improvement! Even though it's not included in the v0.3 roadmap, I'd like to start this task slowly if it's alright.

Summary of abstract
This paper introduces self-speculative decoding, a novel inference scheme designed to accelerate Large Language Models (LLMs) without relying on auxiliary models. It operates in two stages: drafting, which quickly generates draft tokens by selectively skipping intermediate layers, and verification, which validates draft output using the original LLM in a single forward pass. The approach maintains output quality identical to that of unaltered LLMs, without requiring additional neural network training or extra memory, offering a plug-and-play and cost-effective solution for inference acceleration, with benchmarks showing speedups of up to 1.73× on LLaMA-2 and its fine-tuned models.

self speculative decoding paper
code(python implementation)
Look ahead decoding added on this issue: 23.05.14

jafioti · 2024-05-08T13:54:38Z

Yes self-speculative decoding is something I've been interested in for a while. I think it's entirely possible, though I'm a little fuzzy on how the layers are actually chosen for the draft pass. If the skipped layers are fixed, as in they don't change between draft passes, then I think it's very straightforward. You can essentially just have a forward() and forward_draft() on the module to do each pass.

I would suggest using two graphs, one for the normal forward pass and one for the draft pass. During inference you can quicky move weights between graphs with the transfer_data() function.

I'm super excited to see where this goes. Lmk if you have any questions! I'd be happy to help

NewBornRustacean · 2024-05-08T23:15:14Z

Thanks! @jafioti

I'm gonna start with two graphs as you suggested.

btw, where do you think is the right place for this feature to be merged?
creating generation.rs either in luminal/crates/luminal_nn/src/ or luminal/crates/luminal_nn/src/transformer seems possible, I guess.

jafioti · 2024-05-08T23:19:25Z

I would suggest for now just making it an example (copy llama and rename it llama_speculative) and work off that until it takes shape. Once it works we can see how it fits into the whole ecosystem.

NewBornRustacean · 2024-05-10T08:04:55Z

Good morning! @jafioti

According to original implementation of the paper(draft & verify:Lossless Large Language Model Acceleration via Self-Speculative Decoding), the skipped layers are chosen in advance(gaussian process; I think this is not suitable for runtime execution).

It seems like the Bayesian optimization discussed in the paper is implemented using the Python library "bayes_opt". Once the skipped layers are determined, it doesn't seem like they change during the draft pass. However, in my opinion, depending on the type of LLM model and the prompt, the skipped layers could potentially change. So, I'm thinking of implementing a function that takes skipped layers as input for now(maybe const generic).

jafioti · 2024-05-10T13:21:30Z

Sure sounds good👍

jafioti · 2024-06-05T15:25:04Z

Hey @NewBornRustacean , how's it been going with speculative decoding? If you need any help, feel free to reach out. Happy to talk if you're stuck on anything.

NewBornRustacean · 2024-06-05T21:06:39Z

Thanks for your comment! Actually, I've been so busy with work for the past few weeks that I haven't been able to get much done. This week is a holiday in Korea, so I think I'll finally have some time! If I run into any difficulties, I'll ask for help right away.

Btw, the slides you shared(at disocrd) were very helpful to understand the concepts.

I've been reading the Mirage paper you shared during my commute, and although it's difficult, I find it interesting. I want to learn more about this topic so I can contribute more to Luminal!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature suggestion] self speculative decoding #61

[feature suggestion] self speculative decoding #61

NewBornRustacean commented May 8, 2024 •

edited

Loading

jafioti commented May 8, 2024

NewBornRustacean commented May 8, 2024 •

edited

Loading

jafioti commented May 8, 2024

NewBornRustacean commented May 10, 2024 •

edited

Loading

jafioti commented May 10, 2024

jafioti commented Jun 5, 2024

NewBornRustacean commented Jun 5, 2024

[feature suggestion] self speculative decoding #61

[feature suggestion] self speculative decoding #61

Comments

NewBornRustacean commented May 8, 2024 • edited Loading

jafioti commented May 8, 2024

NewBornRustacean commented May 8, 2024 • edited Loading

jafioti commented May 8, 2024

NewBornRustacean commented May 10, 2024 • edited Loading

jafioti commented May 10, 2024

jafioti commented Jun 5, 2024

NewBornRustacean commented Jun 5, 2024

NewBornRustacean commented May 8, 2024 •

edited

Loading

NewBornRustacean commented May 8, 2024 •

edited

Loading

NewBornRustacean commented May 10, 2024 •

edited

Loading