An LLM inference server built with llama-cpp-python and LitServe.
This library aims to be an optimized inference-only version of text-generation-ui.
Currently working on the frontend (Gremory UI), which would be a seperate library.
- Install uv.
- If already installed, check if it is the latest version (>=0.4.18). Update
uv
withuv self update
- Open cmd/bash in the project folder, type
uv sync
to automatically install the virtual env - Enter venv (
.venv\Scripts\activate
orsource .venv/bin/activate
) - Rename config.yaml.example to config.yaml, add your local LLM's absolute path to
model_path
- Open server with
uv run server.py
uv run server.py
will run the server on http://localhost:9052
. It has a single endpoint (/v1/chat/completion
) which is compatible with OpenAI specs.
To test, try out uv run client.py
.
Most of the LLM inference engines have limited options on token sampling. While current LLMs work pretty well with deterministic settings (no sampling), sampling does make a big difference when applied to creative tasks such as writing, roleplay, etc.
Gremory allows users to customize the flow of the sampling process, by inputting the sampling parameters as a list.
Here's an example of a sampling parameter setting:
[
{
"type": "min_p",
"value": 0.1
},
{
"type": "DRY",
"multiplier": 0.85,
"base": 1.75,
"sequence_breakers": ["\n"],
},
{
"type": "temperature",
"value": 1.1
},
]
In this configuration, the sampling process flows in the following order:
- Min P
- DRY
- Temperature
Currently, Gremory supports the samplers listed below:
- Temperature
- Top P
- Top K
- TFS
- Min P
- DRY
- XTC
- Unified Sampler
You can also implement your own sampling algorithms by adding a custom LogitsProcessor
in src/GremoryServer/modules/sampling.py
.
When the last message is assistant
role, Gremory will continue from the last message instead of adding another assistant
message. Also, Gremory has a add_generation_prompt
parameter, allowing to force prefill even when the last message is not assistant
.
- Chat Template Support
- Prefill Response
- Update llama-cpp-python to 0.3.1
- GremoryUI (Currently building with Svelte 5 & shadcn-svelte)
- API wiki
- Quantized KV-Cache
- Tests