You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that there is a disclaimer mentioned that for small and fast draft models like FastDraft, you may not see benefit for dynamic speculative decoding.
Does the same behaviour happen when using CPU?
In a meanwhile, could you provide more details on how you obtained the results which would be helpful for me to reproduce the results from my end?
I'm not sure there should be such slowness also with FastDraft.
To reproduce the results I run the speculative-sampling notebook, while setting device = GPU and setting the prompt to the desired prompt as I mention in the table above.
You should run the first two sections: "Run target model without speculative decoding" and "Run Speculative decoding pipeline" and check the output times.
For the last part that runs dynamic-speculative decoding you should set the prompt inside the pipe.generate command to match the prompt you set for the previous parts: result = pipe.generate(prompt, config, streamer)
I use the master branch.
I use a LNL system:
Windows 11
Intel(R) Core(TM) Ultra 5 238V 2.10GHz
GPU: Intel(R) Arc(TM) 130V GPU gfx-driver-ci-master-17368 DCH RI (16GB)
Below are my results when running speculative-sampling notebook.
Device: GPU
Models and drafts:
Phi-3 pair:
draft_model_id = "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov"
target_model_id = "OpenVINO/Phi-3-mini-4k-instruct-int4-ov"
Llama3.1 pair:
draft_model_id = "OpenVINO/Llama-3.1-8B-Instruct-FastDraft-150M-int8-ov"
target_model_id = "fakezeta/Meta-Llama-3.1-8B-Instruct-ov-int4"
Results:
AR - autoregressive time
SD - speculative decoding time
dynamic - dynamic speculative decoding time
The text was updated successfully, but these errors were encountered: