Evaluating the capabilities of large language models (LLMs) typically requires testing a variety of models from different vendors with different APIs. Golem provides a simple, consistent, command line interface to models from OpenAI, Anthropic, Google, Microsoft Azure and Ollama.
A key problem in LLM benchmarking is that "instance by instance [LLM] results .. are rarely made available" (Burnell et al. 2023). Golem can be used interactively, but is primarily designed for batch mode operation. Golem grew out of the work by Cohn and Blackwell (2024) where we had a large number of questions that we wanted to test on multiple LLMs. We wanted to log all LLM requests and responses in JSONL format ensuring full traceability of prompts, models, timestamps and results.
Golem is alpha quality software and a work in progress; I support it on a reasonable endeavours basis. Pull requests and user feedback are welcome.
Golem is written in Python and designed to work best in a Unix command line environment, for example Linux, Mac OS, or Windows Subsystem for Linux. You can probably use it from within a Windows Command prompt, but Windows command prompts are rarely a happy experience. These instructions assume that you already have a working system with Python (version 3.8 or better) and Git installed. You may also wish to install ./jq, a lightweight and flexible command-line JSON processor.
- Clone the repository
First, clone the repository to your local machine using git:
git clone https://github.com/RobBlackwell/golem.git
cd golem
- Set up a virtual environment (optional but recommended)
It's a good practice to create a virtual environment to manage Python dependencies.
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate
- Install dependencies
Use the requirements.txt
file to install necessary Python packages:
pip install -r requirements.txt
- Install the package
After installing the dependencies, you can install the golem package itself:
pip install .
- Run the golem tool
Now that the software is installed, you can test the command line tool by running:
golem -h
Golem supports a range of APIs and models, but you typically need to register with an API provider to obtain access credentials.
Having registered with OpenAI you can either specify your key as part of the command line, e.g.:
golem --provider openai --key "YOUR-KEY-HERE" "Why is the sky blue?"
If no key is provided, Golem will look for a shell variable called OPENAI_API_KEY. In bash you can set this as follows:
export OPENAI_API_KEY="YOUR-KEY-HERE"
golem --provider openai "Why is the sky blue?"
Having registered with Azure and deployed a model as part of the Azure
OpenAI API service, you should have an endpoint and a key. These can
be specified with the --url
and --key
command line arguments
respectively, or you can set the corresponding shell variables
AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY.
Having deployed an Azure AI model, you should have an endpoint and a
key. These can be specified with the --url
and --key
command line
arguments respectively, or you can set the corresponding shell
variables AZUREAI_ENDPOINT_URL and AZUREAI_ENDPOINT_KEY.
For Google Vertex you need install the gcloud
client because Golem
uses gcloud auth print-access-token
to obtain credentials.
You must also set CLOUDSDK_COMPUTE_REGION to the cloud region you are using (perhaps "europe-west2" for London) and CLOUDSDK_CORE_PROJECT to the name of your gcloud project.
Assuming you have a running Ollama instance on your local machine (instructions here) on port 11434 (the default), then ollama should just work with golem, e.g.:
golem --provider ollama --model llama3 "Why is the sky blue?"
However, you can override the endpoint by setting --url
explicitly if your ollama is running on a different port or URL.
Here is a simple example:
% golem "What is one plus plus?"
{"id": 1, "provider": "ollama", "model": "llama3", "timestamp": "2024-09-16T12:49:45.171742+00:00", "request": {"url": "http://localhost:11434/api/chat", "headers": {}, "json": {"model": "llama3", "messages": [{"role": "user", "content": "What is one plus plus?"}], "stream": false, "options": {}}}, "response": {"model": "llama3", "created_at": "2024-09-16T12:49:45.170587Z", "message": {"role": "assistant", "content": "A clever question!\n\nIn mathematics, \"one plus\" can be written as 1+. This is often used as a shorthand to represent the expression \"one plus some quantity\", where the quantity is implied but not explicitly stated.\n\nSo, in this case, \"one plus plus\" would mean... (drumroll please)... 2!"}, "done_reason": "stop", "done": true, "total_duration": 2436901916, "load_duration": 19158375, "prompt_eval_count": 16, "prompt_eval_duration": 353868000, "eval_count": 69, "eval_duration": 2063010000}, "answer": "A clever question!\n\nIn mathematics, \"one plus\" can be written as 1+. This is often used as a shorthand to represent the expression \"one plus some quantity\", where the quantity is implied but not explicitly stated.\n\nSo, in this case, \"one plus plus\" would mean... (drumroll please)... 2!", "repeat": 0}
Note that no provider or model was specified and so Golem has selected ollama and llama3 respectively as defaults.
There is quite a lot of information here, including timestamp and
model specific outputs. Because each model has its own formats, Golem
always includes a copy of the answer under the "answer:" key, so that
the output is compatible whatever the model. If you just want the
answer then you can jq
to query the output like this:
% golem "What is one plus two?" | jq -r .answer
One plus two is... 3!
It's useful to pipe the output to a JSONL file
% golem "What is one plus three?" > answers.jsonl
You can then query the results.jsonl using jq, for example, to find timestamps.
jq .timestamp answerss.jsonl
"2024-09-16T13:02:28.968544+00:00"
So far, we have included the LLM prompt on the command line, but a more typical use case is to take a list of prompts from a jsonl file and record the results in another jsonl:
golem --provider openai -f prompts.jsonl > answers.jsonl
The prompts.jsonl file is expected to have id and messages fields, e.g.:
{"id": "1", "messages": [{"role": "user", "content": "If a storm is moving from north to south and you are facing the storm, which directions are you facing?"}]}
{"id": "2", "messages": [{"role": "user", "content": "If you are watching the sun rise, which direction are you facing?"}]}
A more comprehensive example based on these kinds of questions is included in the example directory.
Suppose you want to repeat an experiment ten times to assess variability:
~/git/golem/golem.py --provider azure --model gpt-35-turbo-0125 --repeat "0:10" -f prompts.jsonl > answers.jsonl
Suppose you want to vary temperature
from 0.0 to 1.0 in steps of 0.2.
~/git/golem/golem.py --provider azure --model gpt-35-turbo-0125 --temperature "0.0:1.1:0.2" > answers.jsonl
You can also vary top_p
similarly and combine these options with
repeat
.
Additional help and documentation can be found by typing:
golem -h
If you encounter a bug, please report it as a GitHub issue. It can be useful to include detailed debugging information by using the verbose flag, e.g.:
golem -v "Why is the sky blue"
A Golem is an artificial human being described in Hebrew folklore and endowed with life. The word is often used to mean an automaton.
The JSON Lines format is convenient to work with in data science languages such as R, Python and Julia.
./jq is a lightweight and flexible
command-line JSON processor. jq
makes it easy to inspect and query
files at the command line.
JSONL files are easily editable using text editors.
You can keep all your credentials in a token.sh
file in a private
directory and just consult it before using golem by typing source ~/tokens.sh
.
Note that any and all credentials are marked as REDACTED in the log, making it safe to share outputs.
You might also consider using prompto which has asynchronous querying and is supported by the Alan Turing Institute Research Engineering team. Prompto's design philosophy is more Pythonic and pipeline oriented, whilst Golem is designed to play well with the Unix shell and the jq JSON processor. Golem uses HTTP REST APIs directly rather than Python libraries. Golem is easier to hack on and trace for one-off experiments.
Blackwell, R.E. and Cohn, A.G., 2024. Golem - a command line interface for benchmarking Large Language Models. Zenodo, https://doi.org/10.5281/zenodo.14035711 .
Burnell, R., et al., 2023. Rethink reporting of evaluation results in AI. Science, 380(6641), pp.136-138.(https://doi.org/10.1126/science.adf6369)
Blackwell, R.E., Barry, J. and Cohn, A.G., 2024. Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores. arXiv preprint https://arxiv.org/abs/2410.03492
Cohn, A.G. and Blackwell, R.E., 2024. Evaluating the Ability of Large Language Models to Reason About Cardinal Directions (Short Paper). In 16th International Conference on Spatial Information Theory (COSIT 2024). Schloss Dagstuhl–Leibniz-Zentrum für Informatik. (https://doi.org/10.4230/LIPIcs.COSIT.2024.28).