NB: This is an experiment but I welcome feedback if it's useful for you!
This is a stateless server that can compute a "sentence embedding" from given English words or sentences; the request/response model uses a simple JSON-based HTTP API.
A sentence embedding is a way of encoding words into a vector space, such that
similar words or ontological phrases are "close together" as defined by the
distance metric of the vector space. For example, the terms "iPhone" and
"popular smartphone" can each be transformed into a real-valued vector of N
entries; they will be considered "close together", in such a system, while
"iPhone" and "Roku" would be further apart. This is useful for certain forms of
semantic search, for instance.
Internally this server is written in Python, and uses the sentence-transformers library and (transitively) PyTorch to compute embeddings. The API is served over FastAPI by way of hypercorn. The system is packaged and developed with Nix.
Because this server is completely stateless, it can be scaled out vertically with more workers — though, Python will likely always imply some level of compute/latency overhead versus a more optimized solution. However, it is easy, simple to extend, and simple to understand.
There are probably other various clones and/or copies of this idea; but this one is mine.
You have two options to run the server, in general:
- Use Nix with
nix run
(hacking, quick ease of use) - Docker/podman/OCI runtime (probably everywhere else)
docker run --rm -it -p 5000:5000 ghcr.io/thoughtpolice/embedding-server:latest
# OR
nix run --tarball-ttl 0 github:thoughtpolice/embedding-server
The server is now bound to port 5000 (the default.)
Now, you can query the model list, and then encode two independent words with one request:
http get localhost:5000/v1/models
http get localhost:5000/v1/embeddings \
model=all-MiniLM-L6-v2 \
input:='["iPhone","Samsung"]'
Use the /metrics
endpoint.
The API is loosely inspired by the OpenAI Embedding API, used
with text-embedding-ada-002
.
The request is a GET request, with no body. The response is a JSON object like
follows, listing all possible models you can use with the v1/embeddings
endpoint:
{
"data": [
"all-MiniLM-L6-v2",
"nomic-embed-text-v1",
"nomic-embed-text-v1.5"
],
"object": "list"
}
The request is a POST request, with a JSON object body, containing two fields:
input: list[string]
model: string
The input
can simply be a list of words or phrases; the model
is the
supported text embedding model to use, which must be one of the options returned
from v1/models
.
Given a JSON request:
{
"model": "all-MiniLM-L6-v2",
"input": ["iPhone", "Samsung"]
}
The response JSON will look like the following:
{
"data": [
{
"dims": 384,
"embedding": [
-0.02878604456782341,
0.024760600179433823,
0.06652576476335526,
...
],
"index": 0,
"object": "embedding"
},
{
"dims": 384,
"embedding": [
-0.13341815769672394,
0.049686793237924576,
0.043825067579746246,
...
],
"index": 1,
"object": "embedding"
}
],
"model": "all-MiniLM-L6-v2",
"object": "list"
}
This is fairly self explanatory, and effectively the only possible response;
though the object
fields will help the schema evolve in the future. The data
list will have a list of objects, each containing the dimensions of the vector
as well as the index
referring to which input this embedding is for.
The server supports "Matryoshka" embeddings, which are embeddings that remain valid even when they are truncated to a smaller dimension, i.e. if the server responds with a 768-dimensional vector, you can safely drop the last 384 values to get a new 384-dimensional vector that is still valid, but with reduced accuracy.
This feature allows you to use the same model for different applications that require different embeddings, with different size/accuracy tradeoffs. For example, search queries may be able to get by with ~10% of the original dimensionality (~78 dimensions), while retaining 90% or more accuracy of the full-sized vector.
Currently the only supported Matryoshka model is nomic-embed-text-v1.5
, which
by default acts identically to the nomic-embed-text-v1
model; there is no way
in an API request to ask for a specific vector dimensionality; the caller is
responsible for truncating the relevant vectors to the desired dimension
afterwords if they need it.
Note that the Nomic v1.5
model, at the default dimensionality (784), has
negligible performance reduction compared to v1
, so you are free to use v1.5
by default if you are unsure of which you want.
Install direnv into your shell of choice, then move into this repository,
direnv allow
the .envrc
file, and you can just run the
./embedding-server.py
script directly as you wish.
The flake.nix file describes the actual Nix packages that are exported/built from this repo. In short:
nix build '.#embedding-server'
nix build '.#docker-image'
You need impure-derivations
enabled in experimental-features
and,
practically speaking, Nix 2.15 or later, probably, since that's what I test
with.
This package tries using impure-derivations
to package model data. This
feature allows us to write a Nix expression which, in its body, downloads the
model data from huggingface.co over the network; this data is then "purified"
with a fixed-output derivation.
This allows us to have a single source of truth — the
embedding-server.py
script itself — as the source of truth for all model
data, so we don't have to manually replicate all of the downloads of each model
inside a .nix
file. However, we do have to update the hash
of the
fixed-output derivation, and it isn't clear if the hugging face libraries can be
configured to download stable model versions. We may have to use another
approach, eventually.
MIT or Apache-2.0; see the corresponding LICENSE-*
file for details.