Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

Open
shira-g opened this issue Jan 1, 2025 · 4 comments
Assignees
Labels
category: GPU OpenVINO GPU plugin PSE Escalate to PSE for further investigate

Comments

@shira-g
Copy link
Contributor

shira-g commented Jan 1, 2025

Below are my results when running speculative-sampling notebook.

Device: GPU
Models and drafts:
Phi-3 pair:
draft_model_id = "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov"
target_model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"

Llama3.1 pair:
draft_model_id = "OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov"
target_model_id = "fakezeta/Meta-Llama-3.1-8B-Instruct-ov-int4"

Results:
AR - autoregressive time
SD - speculative decoding time
dynamic - dynamic speculative decoding time

Screenshot 2025-01-01 at 9 19 02

@YuChern-Intel
Copy link

I noticed that there is a disclaimer mentioned that for small and fast draft models like FastDraft, you may not see benefit for dynamic speculative decoding.

Does the same behaviour happen when using CPU?

In a meanwhile, could you provide more details on how you obtained the results which would be helpful for me to reproduce the results from my end?

@shira-g
Copy link
Contributor Author

shira-g commented Jan 7, 2025

I'm not sure there should be such slowness also with FastDraft.
To reproduce the results I run the speculative-sampling notebook, while setting device = GPU and setting the prompt to the desired prompt as I mention in the table above.

You should run the first two sections: "Run target model without speculative decoding" and "Run Speculative decoding pipeline" and check the output times.
For the last part that runs dynamic-speculative decoding you should set the prompt inside the pipe.generate command to match the prompt you set for the previous parts:
result = pipe.generate(prompt, config, streamer)

@YuChern-Intel
Copy link

From my side, I am getting ProgramBuilder build failed error when using GPU in running speculative decoding pipeline.

Can I know your GPU specifications and which branch (master branch or any specific branch) are you using?

@shira-g
Copy link
Contributor Author

shira-g commented Jan 9, 2025

I use the master branch.
I use a LNL system:
Windows 11
Intel(R) Core(TM) Ultra 5 238V 2.10GHz
GPU: Intel(R) Arc(TM) 130V GPU gfx-driver-ci-master-17368 DCH RI (16GB)

It might reproduce on CPU as well I didn't check.

@YuChern-Intel YuChern-Intel added the PSE Escalate to PSE for further investigate label Jan 13, 2025
@avitial avitial added the category: GPU OpenVINO GPU plugin label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin PSE Escalate to PSE for further investigate
Projects
None yet
Development

No branches or pull requests

4 participants