Llamaxing is a modular API gateway that allows you to control traffic between users/applications and the (Azure) OpenAI API. It comes with a number of modules, but can easily be extended to suit your particular needs. The main features are:
- Custom authentication/authorization flows
- Built-in support for API key and JWT authentication
- Sidecar mode with built-in support for API key and Microsoft Entra ID authentication
- Logging of all request/responses for auditing and analytics
- Built-in support for MongoDB
- Integration with observability platforms
- Built-in support for Langfuse with automatic creation of traces/generations for requests sent to Llamaxing
- Load balancing across multiple OpenAI API deployments
The documentation for this project is still quite limited. For an overview of Llamaxing's functionality, take a look at the examples below.
Llamaxing needs a list of models provided as a JSON file (by default called models.json
). Examples are given here.
The file should contain a list of JSON objects with the following schema:
Parameter | Type | Description | Options |
---|---|---|---|
id |
string | Model name used when calling the API endpoints | |
aliases |
string | List of alternative names for the model | |
capabilities |
list of strings | Type of model. Used to match models with API endpoints | chat_completions , completions , embeddings , images_generations |
instances |
list of objects | List of API deployments for the model. If more than one deployment is provided, Llamaxing will load balance across all deployments for the model. |
The instance object has the following schema:
Parameter | Type | Description | Options |
---|---|---|---|
id |
string | ID of model/deployment combination. Must be unique | |
provider |
string | API provider | azure , openai |
openai_organization |
string | Organization ID passed to OpenAI API | null or any string |
openai_api_key |
string | OpenAI API key | Literal API key or environment variable (see notes) |
azure_endpoint |
string | Azure OpenAI deployment endpoint | |
azure_deployment |
string | Azure OpenAI deployment name | Literal name or environment variable (see notes) |
azure_api_key |
string | Azure OpenAI API key | Literal API key or environment variable (see notes) |
azure_api_version |
string | Azure OpenAI API version |
Notes:
- Parameters prepended with
openai_
should only be included if the provider is OpenAI, likewise for Azure. - For parameters where it is noted, you can reference an environment
variable rather than specifying the actual value, e.g.
"${OPENAI_API_KEY}"
. - It is possible to mix Azure and OpenAI when specifying model API deployments.
If authentication is enabled, Llamaxing needs a list the identities that are authorized to access the API. These identities can represent both users and applications. The list is provided as a JSON file (by default called identities.json
).
The file should contain a list of JSON objects with the following schema:
Parameter | Type | Description |
---|---|---|
id |
string | ID of the identity. Typically this would be application name, username or email address |
auth_key |
string | Key used to identify the identity. Depends on the chosen auth. method. See below |
name |
strings | Name of the identity |
info |
object | Any additional information about the identity to be included in logging. Not used in Llamaxing |
observability |
object | Parameters passed to observability client |
The only observability platform currently supported is Langfuse. To enable it for an identity, add the following parameters to the observability object:
Parameter | Type | Description |
---|---|---|
langfuse_public_key |
string | Langfuse public key |
langfuse_secret_key |
string | Langfuse secret key |
Settings for Llamaxing are defined here and
can be set using environment variables or a .env
file. The main settings are:
Parameter | Type | Description | Options | Default |
---|---|---|---|---|
app_name |
string | Application name | llamaxing |
|
app_mode |
string | Application mode | gateway , sidecar |
gateway |
app_requests_timeout |
int | Timeout limit when sending requests upstream | 300 | |
debug_level |
int | Controls how much is logged to the console | 0: No debugging information. 1: Requests, responses, and auth. details. 2: Raw streaming responses. 3: Sensitive information such as access tokens, use with caution! | 0 |
auth_method |
string | Authentication method | none , apikey , jwt |
none |
identity_store |
string | Identity store | none , json |
none |
logging_client |
string | Logging client | none , mongodb |
none |
observability_client |
string | Observability client | none , langfuse |
none |
For each parameter that defaults to none
, there are additional parameters that should be set, if you change it to a different value.
For example, if you set auth_method
to jwt
, then there are a number of parameters (all starting with auth_method_jwt_
) that you need to consider. See settings.py for a full list.
A good place to start is the simplest example: 01-simple-noauth. Here all modules (including authentication) are disabled.
To run it, you need to create a .env
file in the folder with your API key based on either of the samples (depending on whether you use OpenAI or Azure) and bring up the container using docker compose:
docker compose up
You can now navigate to http://localhost:8000/ to access the API docs, or try out one of the demos.
In this example, 02-simple-apikey, we add API key authentication. In this scenario you have a number of users/applications that need access to OpenAI, but you do not want to simply share your API key with all of them. Instead you assign each of them a unique API key and have them access OpenAI through Llamaxing.
Compared to the previous example we now set AUTH_METHOD=apikey
and IDENTITY_STORE=json
and create an identities.json file with a single identity. Note that the identity is identified by the API key it has been provided (1234
in this example)
which is set using the auth_key
parameter in identities.json
. If you bring up the container with docker compose and access the documentation, you will see
that you now need to provide the API key, and you will see in the output from Llamaxing that the identity of the caller is logged when requests are made.
In the third example, 03-azure-jwt-oauth2-proxy, authentication is now done using JSON Web Tokens (JWT). Llamaxing can be configured to verify the signature of the JWTs (it isn't be default, but should be in a production environment), and it will reject requests from unknown identities. However, typically you would deploy an authenticating proxy in front of Llamaxing that would have already verified the JWT before passing it on to Llamaxing. In this example we use OAuth2 Proxy, but often this would be handled by the ingress controller on your Kubernetes cluster.
The example uses Microsoft Entra ID as identity provider and assumes that you already have an Azure subscription.
In order to run it, you will need to create an app registration,
and fill in all the relevant values in the .env
file based on one of the sample files. All values that
need to be filled in are written in brackets, e.g. <tenant id>
.
You can follow these steps to set up the app registration:
- Note down your Tenant ID and the Client ID of your app registration.
- Go to Certificates & secrets, create a client secret and note it down.
- Go to Authentication -> Add platform -> Web and set the Redirect URI to
http://localhost:4180/oauth2/callback
. - Go to API permissions -> Add a permission -> Microsoft Graph -> Delegated permissions and select
email
,offline_access
,openid
,profile
and click Add permissions. - Go to API permissions -> Grant admin consent for Default Directory.
- Go to Manifest -> Edit the manifest to have
"accessTokenAcceptedVersion": 2
and save.
Based on this you now have most of the values needed to populate the .env
file.
However, you also need to generate a cookie secret for OAuth2 Proxy. This can be done using the following:
python -c 'import os,base64; print(base64.urlsafe_b64encode(os.urandom(32)).decode())'
Since we now use Microsoft Entra ID, we will use the Object ID of users/applications to define
our identities. In the Azure portal go to Microsoft Entra ID -> Users,
find your user and note down the Object ID and fill it in as the value for auth_key
in identities.json
(using
sample-identities.json
as a template).
Note, that while Llamaxing uses Object ID by default, this can be changed using the auth_method_jwt_id_key
parameter.
You should now be able to bring up Llamaxing using docker compose up
and access it through OAuth2 Proxy by navigating to
http://localhost:4180/.
This example, 04-azure-jwt-full-stack, builds on the one above. We now add logging to MongoDB and observability using Langfuse. Furthermore, we include an instance of Llamaxing running in sidecar mode to show how authentication can be off-loaded if your application doesn't support your authentication scheme.
In order to support these additions, we need to modify the app registration created above. On top of being used for single sign-on (SSO) in OAuth2 Proxy, as above, we will also use it for SSO in Langfuse, and it will provide the identity needed to authenticate from the sidecar. Note, that it is purely for simplicity that we are using the same app registration for all these purposes.
Go to the Azure portal, navigate to your app registration and do the following:
- Go to Expose an API -> Application ID URI / Add -> Save. Note down the Application ID URI which should be of the form:
api://<client id>
. - Go to Authentication -> Web / Add URI, add
http://localhost:3000/api/auth/callback/azure-ad
and save. - Go to Overview -> Managed application in local directory -> click on the name of the app registration. Note down the Object ID, this is the Enterprise Application Object ID.
Together with the values from the previous example you can now create a .env
file based on one of the samples.
Langfuse needs two generated secrets, these can be created as in the previous example.
You also need to create an identities.json
file based on the sample file, but before doing this you will need
to generate API keys for Langfuse:
- Bring up the stack with
docker compose up
. - Navigate to http://localhost:3000/ and log in to Langfuse.
- Create a new project, go to Settings -> API keys -> Create new API keys. Note down the keys.
You now have all the details needed to populate identities.json
. Once this is done, restart Llamaxing
to reload the configuration.
You now have a number of services running:
- As in the previous example you can access Llamaxing through OAuth2 Proxy at http://localhost:4180/, and if you make a request you will see in the console output that you are identified by your Microsoft Entra ID user.
- Llamaxing is also running in sidecar mode at http://localhost:8000/. If you try to access it you will see that you do not need to authenticate, but if you make a request you'll see in the console output that the request is authenticated using the identity of the app registration. This mode is useful if you have an application that needs access to OpenAI but isn't compatible with your authentication scheme. See here for more on this architectural pattern.
- An instance of MongoDB is running and all requests will be logged to it.
- As already mentioned, Langfuse is running at http://localhost:3000/. Since the API keys
for Langfuse have been added to both identities in
identities.json
all request will also be visible in Langfuse. By default a request will be added to Langfuse as a trace with a single generation. The ID of the identity that made the request will be added (as User ID) and model details will be filled in. If you want to add more details and group multiple requests then you need to send the appropriate information in a custom field calledobservation_metadata
. See demo-langfuse.py for an example.