Replies: 2 comments 2 replies
-
I would love something like this that could work with an openai compatible endpoint. I serve models locally using vllm or aphrodite. Llama.cpp is too slow and doesn't support concurrency. |
Beta Was this translation helpful? Give feedback.
-
I'm presently engaged with adding support for a new self-developed backend, HF-Waitress This new backend adds support for HF-Transformer & AWQ-quantized models directly off the hub, while providing on-the-fly quantization via BitsAndBytes, HQQ and Quanto.. It also negates the need to manually download LLMs yourself, simply working off the model name to do the rest. It works OOB with no setup necessary, and provides concurrency and streaming responses all within a single platform-agnostic Python script that can be ported anywhere. It will soon be the default LLM-loader in LARS! As Ollama is another implementation of llama.cpp, explicit support for it is not planned at this time though I recognize the benefits. llama.cpp will be retained in LARS as a user-electable alternative to HF-Waitress for GGUF models, primarily due to their advantage of hybrid-inferencing. You'll be able to bring in your own GGUFs same as today. OpenAI is not planned at this time as LARS remains open-source, local-deployment centric. However, code to make OpenAI work is already in the LARS codebase so if an official engagement necessitates it, I will work on enabling it. In the meanwhile, community-contributions are absolutely welcome as always for these features! |
Beta Was this translation helpful? Give feedback.
-
I already have a local instance of Ollama that I'm using for other AI applications. Can I point LARS at that as opposed to installing new models on the host machine?
Beta Was this translation helpful? Give feedback.
All reactions