feat: major eval revamp, openrouter support, removed --llm in favor…

… of `--model <provider>/<model>`
ErikBjare · Aug 9, 2024 · 6fa8016 · 6fa8016
1 parent 746c733
commit 6fa8016
Show file tree

Hide file tree

Showing 13 changed files with 383 additions and 175 deletions.
diff --git a/README.md b/README.md
@@ -230,9 +230,10 @@ Options:
   --name TEXT                     Name of conversation. Defaults to generating
                                   a random name. Pass 'ask' to be prompted for
                                   a name.
-  --llm [openai|anthropic|azure|local]
-                                  LLM provider to use.
-  --model TEXT                    Model to use.
+  --model TEXT                    Model to use, e.g. openai/gpt-4-turbo,
+                                  anthropic/claude-3-5-sonnet-20240620. If
+                                  only provider is given, the default model
+                                  for that provider is used.
   --stream / --no-stream          Stream responses
   -v, --verbose                   Verbose output.
   -y, --no-confirm                Skips all confirmation prompts.

diff --git a/docs/providers.md b/docs/providers.md
@@ -3,6 +3,16 @@ Providers
 
 We support several LLM providers, including OpenAI, Anthropic, Azure, and any OpenAI-compatible server (e.g. `ollama`, `llama-cpp-python`).
 
+To select a provider and model, run `gptme` with the `--model` flag set to `<provider>/<model>`, for example:
+
+```sh
+gptme --model openai/gpt-4o "hello"
+gptme --model anthropic "hello"  # if model part unspecified, will fall back to the provider default
+gptme --model openrouter/meta-llama/llama-3.1-70b-instruct "hello"
+```
+
+On first startup, if `--model` is not set, and no API keys are set in the config or environment it will be prompted for. It will then auto-detect the provider, and save the key in the configuration file.
+
 ## OpenAI
 
 To use OpenAI, set your API key:
@@ -11,8 +21,6 @@ To use OpenAI, set your API key:
 export OPENAI_API_KEY="your-api-key"
 ```
 
-If no key is set, it will be prompted for and saved in the configuration file.
-
 ## Anthropic
 
 To use Anthropic, set your API key:
@@ -21,11 +29,17 @@ To use Anthropic, set your API key:
 export ANTHROPIC_API_KEY="your-api-key"
 ```
 
-If no key is set, it will be prompted for and saved in the configuration file.
+## OpenRouter
+
+To use OpenRouter, set your API key:
+
+```sh
+export OPENROUTER_API_KEY="your-api-key"
+```
 
 ## Local
 
-There are several ways to run local LLM models in a way that exposes a OpenAI API-compatible server, here we will cover two:
+There are several ways to run local LLM models in a way that exposes a OpenAI API-compatible server, here we will cover:
 
 ### ollama + litellm
 
@@ -39,33 +53,3 @@ ollama serve
 litellm --model ollama/mistral
 export OPENAI_API_BASE="http://localhost:8000"
 ```
-
-### llama-cpp-python
-
-Here's how to use `llama-cpp-python`.
-
-You first need to install and run the [llama-cpp-python][llama-cpp-python] server. To ensure you get the most out of your hardware, make sure you build it with [the appropriate hardware acceleration][hwaccel]. For macOS, you can find detailed instructions [here][metal].
-
-```sh
-MODEL=~/ML/wizardcoder-python-13b-v1.0.Q4_K_M.gguf
-poetry run python -m llama_cpp.server --model $MODEL --n_gpu_layers 1  # Use `--n_gpu_layer 1` if you have a M1/M2 chip
-export OPENAI_API_BASE="http://localhost:8000/v1"
-```
-
-### Usage
-
-Now, simply run `gptme` with the `--llm` flag set to `local`:
-
-```sh
-gptme --llm local "hello"
-```
-
-### How well does it work?
-
-I've had mixed results. They are not nearly as good as GPT-4, and often struggles with the tools laid out in the system prompt. However I haven't tested with models larger than 7B/13B.
-
-I'm hoping future models, trained better for tool-use and interactive coding (where outputs are fed back), can remedy this, even at 7B/13B model sizes. Perhaps we can fine-tune a model on (GPT-4) conversation logs to create a purpose-fit model that knows how to use the tools.
-
-[llama-cpp-python]: https://github.com/abetlen/llama-cpp-python
-[hwaccel]: https://github.com/abetlen/llama-cpp-python#installation-with-hardware-acceleration
-[metal]: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
diff --git a/eval/agents.py b/eval/agents.py
@@ -10,8 +10,7 @@
 
 
 class Agent:
-    def __init__(self, llm: str, model: str):
-        self.llm = llm
+    def __init__(self, model: str):
         self.model = model
 
     @abstractmethod
@@ -42,7 +41,6 @@ def act(self, files: Files | None, prompt: str):
                 [Message("user", prompt)],
                 [prompt_sys],
                 f"gptme-evals-{store.id}",
-                llm=self.llm,
                 model=self.model,
                 no_confirm=True,
                 interactive=False,

diff --git a/eval/evals.py b/eval/evals.py
@@ -3,16 +3,48 @@
 if TYPE_CHECKING:
     from main import ExecTest
 
+
+def correct_output_hello(ctx):
+    return ctx.stdout == "Hello, human!\n"
+
+
+def correct_file_hello(ctx):
+    return ctx.files["hello.py"].strip() == "print('Hello, human!')"
+
+
+def check_prime_output(ctx):
+    return "541" in ctx.stdout.split()
+
+
+def check_clean_exit(ctx):
+    return ctx.exit_code == 0
+
+
+def check_clean_working_tree(ctx):
+    return "nothing to commit, working tree clean" in ctx.stdout
+
+
+def check_main_py_exists(ctx):
+    return "main.py" in ctx.files
+
+
+def check_commit_exists(ctx):
+    return "No commits yet" not in ctx.stdout
+
+
+def check_output_hello_ask(ctx):
+    return "Hello, Erik!" in ctx.stdout
+
+
 tests: list["ExecTest"] = [
     {
         "name": "hello",
         "files": {"hello.py": "print('Hello, world!')"},
         "run": "python hello.py",
         "prompt": "Change the code in hello.py to print 'Hello, human!'",
         "expect": {
-            "correct output": lambda ctx: ctx.stdout == "Hello, human!\n",
-            "correct file": lambda ctx: ctx.files["hello.py"].strip()
-            == "print('Hello, human!')",
+            "correct output": correct_output_hello,
+            "correct file": correct_file_hello,
         },
     },
     {
@@ -21,9 +53,8 @@
         "run": "python hello.py",
         "prompt": "Patch the code in hello.py to print 'Hello, human!'",
         "expect": {
-            "correct output": lambda ctx: ctx.stdout == "Hello, human!\n",
-            "correct file": lambda ctx: ctx.files["hello.py"].strip()
-            == "print('Hello, human!')",
+            "correct output": correct_output_hello,
+            "correct file": correct_file_hello,
         },
     },
     {
@@ -33,7 +64,7 @@
         # TODO: work around the "don't try to execute it" part by improving gptme such that it just gives EOF to stdin in non-interactive mode
         "prompt": "modify hello.py to ask the user for their name and print 'Hello, <name>!'. don't try to execute it",
         "expect": {
-            "correct output": lambda ctx: "Hello, Erik!" in ctx.stdout,
+            "correct output": check_output_hello_ask,
         },
     },
     {
@@ -42,7 +73,7 @@
         "run": "python prime.py",
         "prompt": "write a script prime.py that computes and prints the 100th prime number",
         "expect": {
-            "correct output": lambda ctx: "541" in ctx.stdout.split(),
+            "correct output": check_prime_output,
         },
     },
     {
@@ -51,11 +82,10 @@
         "run": "git status",
         "prompt": "initialize a git repository, write a main.py file, and commit it",
         "expect": {
-            "clean exit": lambda ctx: ctx.exit_code == 0,
-            "clean working tree": lambda ctx: "nothing to commit, working tree clean"
-            in ctx.stdout,
-            "main.py exists": lambda ctx: "main.py" in ctx.files,
-            "we have a commit": lambda ctx: "No commits yet" not in ctx.stdout,
+            "clean exit": check_clean_exit,
+            "clean working tree": check_clean_working_tree,
+            "main.py exists": check_main_py_exists,
+            "we have a commit": check_commit_exists,
         },
     },
     # Fails, gets stuck on interactive stuff