Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

shira-g · 2025-01-01T07:20:29Z

Below are my results when running speculative-sampling notebook.

Device: GPU
Models and drafts:
Phi-3 pair:
draft_model_id = "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov"
target_model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"

Llama3.1 pair:
draft_model_id = "OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov"
target_model_id = "fakezeta/Meta-Llama-3.1-8B-Instruct-ov-int4"

Results:
AR - autoregressive time
SD - speculative decoding time
dynamic - dynamic speculative decoding time

YuChern-Intel · 2025-01-03T05:34:31Z

I noticed that there is a disclaimer mentioned that for small and fast draft models like FastDraft, you may not see benefit for dynamic speculative decoding.

Does the same behaviour happen when using CPU?

In a meanwhile, could you provide more details on how you obtained the results which would be helpful for me to reproduce the results from my end?

shira-g · 2025-01-07T09:06:16Z

I'm not sure there should be such slowness also with FastDraft.
To reproduce the results I run the speculative-sampling notebook, while setting device = GPU and setting the prompt to the desired prompt as I mention in the table above.

You should run the first two sections: "Run target model without speculative decoding" and "Run Speculative decoding pipeline" and check the output times.
For the last part that runs dynamic-speculative decoding you should set the prompt inside the pipe.generate command to match the prompt you set for the previous parts:
result = pipe.generate(prompt, config, streamer)

YuChern-Intel · 2025-01-09T02:23:27Z

From my side, I am getting ProgramBuilder build failed error when using GPU in running speculative decoding pipeline.

Can I know your GPU specifications and which branch (master branch or any specific branch) are you using?

shira-g · 2025-01-09T07:25:01Z

I use the master branch.
I use a LNL system:
Windows 11
Intel(R) Core(TM) Ultra 5 238V 2.10GHz
GPU: Intel(R) Arc(TM) 130V GPU gfx-driver-ci-master-17368 DCH RI (16GB)

It might reproduce on CPU as well I didn't check.

YuChern-Intel assigned Munesh-Intel and YuChern-Intel Jan 2, 2025

YuChern-Intel added the PSE Escalate to PSE for further investigate label Jan 13, 2025

avitial added the category: GPU OpenVINO GPU plugin label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

shira-g commented Jan 1, 2025

YuChern-Intel commented Jan 3, 2025

shira-g commented Jan 7, 2025

YuChern-Intel commented Jan 9, 2025

shira-g commented Jan 9, 2025

Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

Dynamic speculative decoding is significantly slower than auto-regressive and than speculative decoding generation #2621

Comments

shira-g commented Jan 1, 2025

YuChern-Intel commented Jan 3, 2025

shira-g commented Jan 7, 2025

YuChern-Intel commented Jan 9, 2025

shira-g commented Jan 9, 2025