-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Code Llama 7B #370
Comments
I have an under-development version hosted at https://huggingface.co/TabbyML/CodeLlama-7B. However, we are still working on implementing the tokenization of stop words for line breaks. I will keep you updated on our progress regarding this issue. |
Once #371 is merged and released in the daily docker build, Please note that this model is significantly larger (7B) compared to our current recommendation, such as SantaCoder-1B, for a T4 GPU. |
@wsxiaoys Would you have an idea of how much VRAM this model consumes? Thank you. |
By default, Tabby operates in int8 mode with CUDA, requiring approximately 8GB of VRAM for CodeLlama-7B. |
Will there be further support for quantisation,like GPTQ to make even bigger models more useble? |
I am surprised because I've tested on my Nvidia P100 16GB VRAM and the container returns :
With command :
Would you have any idea ? |
For the code completion use case, a rough breaking threshold is around 10 billion parameters to determine whether tensor parallelism (model parallelism) is necessary for reasonable latency. Therefore, it's unlikely that we will invest significant effort in that direction. As for FAQ use cases, since the latency requirements are considerably more relaxed in this scenario, we are very interested in exploring inference with GPT-Q. |
I've maybe forgotten the
|
Could you share your cuda setup? maybe attached output of |
Here it is :
Is the Pascal architecture too old ? |
Yes - int8 precision requires cuda compute capability >= 7.0 or 6.1. While P100 has compute capability of 6.0 thus could only utilize float32 inference. I added a faq section on website to further elaborate this: https://tabbyml.github.io/tabby/docs/faq |
Very clear, thank you. |
Confirmed working on RTX3070 with |
As far as I know,there is still something can be done to decrease latency on larger models,with some tricks to overcome vram bandwidth bottleneck and increase GPU utilization. |
https://x.com/ggerganov/status/1694775472658198604?s=46 might worth prioritize llasupport integrating since the speculative decoding (assisted generation) gives such a high performance bump …… |
Please describe the feature you want
Code Llama, released yesterday by Meta, is pretending better performance than GPT3.5 for code generation.
I saw the following project : https://huggingface.co/TabbyML/CodeLlama-7B
When is it scheduled to be released ?
Thanks a lot to the TabbyML team.
The text was updated successfully, but these errors were encountered: