Skip to content

jakaline-dev/Gremory

Repository files navigation

Gremory

An LLM inference server built with llama-cpp-python and LitServe.

This library aims to be an optimized inference-only version of text-generation-ui.

Currently working on the frontend (Gremory UI), which would be a seperate library.

Installation

  1. Install uv.
  2. If already installed, check if it is the latest version (>=0.4.18). Update uv with uv self update
  3. Open cmd/bash in the project folder, type uv sync to automatically install the virtual env
  4. Enter venv (.venv\Scripts\activate or source .venv/bin/activate)
  5. Rename config.yaml.example to config.yaml, add your local LLM's absolute path to model_path
  6. Open server with uv run server.py

How to use

uv run server.py will run the server on http://localhost:9052. It has a single endpoint (/v1/chat/completion) which is compatible with OpenAI specs.

To test, try out uv run client.py.

Features

OpenAI Compatible

Sampler API

Most of the LLM inference engines have limited options on token sampling. While current LLMs work pretty well with deterministic settings (no sampling), sampling does make a big difference when applied to creative tasks such as writing, roleplay, etc.

Gremory allows users to customize the flow of the sampling process, by inputting the sampling parameters as a list.

Here's an example of a sampling parameter setting:

[
    {
        "type": "min_p",
        "value": 0.1
    },
    {
        "type": "DRY",
        "multiplier": 0.85,
        "base": 1.75,
        "sequence_breakers": ["\n"],
    },
    {
        "type": "temperature",
        "value": 1.1
    },
]

In this configuration, the sampling process flows in the following order:

  1. Min P
  2. DRY
  3. Temperature

Supported Samplers

Currently, Gremory supports the samplers listed below:

You can also implement your own sampling algorithms by adding a custom LogitsProcessor in src/GremoryServer/modules/sampling.py.

Prefill Response

What is Prefill Response?

When the last message is assistant role, Gremory will continue from the last message instead of adding another assistant message. Also, Gremory has a add_generation_prompt parameter, allowing to force prefill even when the last message is not assistant.

TODO

  • Chat Template Support
  • Prefill Response
  • Update llama-cpp-python to 0.3.1
  • GremoryUI (Currently building with Svelte 5 & shadcn-svelte)
  • API wiki
  • Quantized KV-Cache
  • Tests

About

Lightweight LLM inference server

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages