You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
I noticed when using neuralchat_server for a chat completion is very slow, compared to loading the model through AutoModelForCausalLM then do a generate (after applying chat_template).
Both cases, I'm using same quantization config and the same model
# with intel_extension_for_transformer RtnConfig(compute_dtype="fp32",weight_dtype="int4")
I noticed when using neuralchat_server for a chat completion is very slow, compared to loading the model through AutoModelForCausalLM then do a generate (after applying chat_template).
Is this slowdown with the neuralchat_server expected? Or is there other alternative to start an OpenAI compatible api server?
The text was updated successfully, but these errors were encountered: