Gremory

An LLM inference server built with llama-cpp-python and LitServe.

This library aims to be an optimized inference-only version of text-generation-ui.

Currently working on the frontend (Gremory UI), which would be a seperate library.

Installation

Install uv.
If already installed, check if it is the latest version (>=0.4.18). Update uv with uv self update
Open cmd/bash in the project folder, type uv sync to automatically install the virtual env
Enter venv (.venv\Scripts\activate or source .venv/bin/activate)
Rename config.yaml.example to config.yaml, add your local LLM's absolute path to model_path
Open server with uv run server.py

How to use

uv run server.py will run the server on http://localhost:9052. It has a single endpoint (/v1/chat/completion) which is compatible with OpenAI specs.

To test, try out uv run client.py.

Features

OpenAI Compatible

Sampler API

Most of the LLM inference engines have limited options on token sampling. While current LLMs work pretty well with deterministic settings (no sampling), sampling does make a big difference when applied to creative tasks such as writing, roleplay, etc.

Gremory allows users to customize the flow of the sampling process, by inputting the sampling parameters as a list.

Here's an example of a sampling parameter setting:

[
    {
        "type": "min_p",
        "value": 0.1
    },
    {
        "type": "DRY",
        "multiplier": 0.85,
        "base": 1.75,
        "sequence_breakers": ["\n"],
    },
    {
        "type": "temperature",
        "value": 1.1
    },
]

In this configuration, the sampling process flows in the following order:

Min P
DRY
Temperature

Supported Samplers

Currently, Gremory supports the samplers listed below:

Temperature
Top P
Top K
TFS
Min P
DRY
XTC
Unified Sampler

You can also implement your own sampling algorithms by adding a custom LogitsProcessor in src/GremoryServer/modules/sampling.py.

Prefill Response

What is Prefill Response?

When the last message is assistant role, Gremory will continue from the last message instead of adding another assistant message. Also, Gremory has a add_generation_prompt parameter, allowing to force prefill even when the last message is not assistant.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src/gremory		src/gremory
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
client.py		client.py
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gremory

Installation

How to use

Features

OpenAI Compatible

Sampler API

Supported Samplers

Prefill Response

TODO

About

Releases

Packages

Languages

License

jakaline-dev/Gremory

Folders and files

Latest commit

History

Repository files navigation

Gremory

Installation

How to use

Features

OpenAI Compatible

Sampler API

Supported Samplers

Prefill Response

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages