-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can inference be run on consumer hardware? #8
Comments
@GrahamboJangles It is already in FastChat. https://github.com/lm-sys/FastChat#longchat We currently test it in A100 single GPU and it works pretty well. We are adding more support to let it run more efficiently. Let me know whether it works for your hardware, and we can improve the system support! |
@DachengLi1 I have 2 RX6800s, I'm guessing that they are not yet supported? |
Regarding RX Series, please see the discussion here. The inference is backed by FastChat, and it seems people can AMD card working. Can you run (there is no load-8-bit yet): python3 -m fastchat.serve.cli --model-path lmsys/longchat-7b-16k and let me know if it works for you? Also feel free to submit an issue in FastChat regarding this. |
@DachengLi1 thank you for your help and quick responses. I ran that command and this was the output:
|
@GrahamboJangles Thanks for trying it out! Can you submit this to FastChat system? I will also ask the FastChat team to look into it there. |
@DachengLi1 Absolutely! Thanks again for your help. |
@DachengLi1 I was trying to run inference using Longchat-7b-16k on an A100 machine comprising a 40GB GPU. I get a cuda out-of-memory error as the memory was not sufficient. The texts I was using as input from a parquet file were around 9k tokens each. Can you tell me about the upcoming roadmap for efficiency gains and any ETA for it so that I can run inference using lesser resources? |
@sejalchopra97 For now you can run 9k tokens with flash attention support (but that does not support kv cache so it will be slow). We just got a member working on it on the vLLM side, once she got it done, we can update here. @LiuXiaoxuanPKU, let me know if you have any suggestion! |
AMD? CPU? Single GPU?
Is this all possible via FastChat?
The text was updated successfully, but these errors were encountered: