From 5a0c47e7f8e3032866ae051459d968241a87943e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= <erik@bjareho.lt>
Date: Fri, 23 Aug 2024 18:37:51 +0200
Subject: [PATCH] docs: added basic docs for evals

---
 docs/evals.rst | 39 +++++++++++++++++++++++++++++++++++++++
 docs/index.rst |  1 +
 2 files changed, 40 insertions(+)
 create mode 100644 docs/evals.rst
diff --git a/docs/evals.rst b/docs/evals.rst
new file mode 100644
index 00000000..20938cd1
--- /dev/null
+++ b/docs/evals.rst
@@ -0,0 +1,39 @@
+Evals
+=====
+
+gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
+
+To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
+
+.. note::
+    The evaluation suite is still under development, but the eval harness is mostly complete.
+
+You can run the simple ``hello`` eval with gpt-4o like this:
+
+.. code-block:: bash
+
+    gptme-eval hello --model openai/gpt-4o
+
+However, we recommend running it in Docker to improve isolation and reproducibility:
+
+.. code-block:: bash
+
+    make build-docker
+    docker run \
+        -e "OPENAI_API_KEY=<your api key>" \
+        -v $(pwd)/eval_results:/app/gptme/eval_results \
+        gptme --timeout 60 $@
+
+
+Example run
+-----------
+
+Here's the output from a run of the eval suite: TODO
+
+
+Other evals
+-----------
+
+We have considered running gptme on other evals, such as SWE-Bench, but have not yet done so.
+
+If you are interested in running gptme on other evals, drop a comment in the issues!
diff --git a/docs/index.rst b/docs/index.rst
index e6d7a6a9..bb7ef344 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -28,6 +28,7 @@ See the `README <https://github.com/ErikBjare/gptme/blob/master/README.md>`_ fil
    tools
    providers
    webui
+   evals
    finetuning
    cli
    api