Skip to content

Commit

Permalink
docs: added basic docs for evals
Browse files Browse the repository at this point in the history
  • Loading branch information
ErikBjare committed Aug 23, 2024
1 parent e30bd08 commit 5a0c47e
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 0 deletions.
39 changes: 39 additions & 0 deletions docs/evals.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Evals
=====

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

.. note::
The evaluation suite is still under development, but the eval harness is mostly complete.

You can run the simple ``hello`` eval with gpt-4o like this:

.. code-block:: bash
gptme-eval hello --model openai/gpt-4o
However, we recommend running it in Docker to improve isolation and reproducibility:

.. code-block:: bash
make build-docker
docker run \
-e "OPENAI_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/gptme/eval_results \
gptme --timeout 60 $@
Example run
-----------

Here's the output from a run of the eval suite: TODO


Other evals
-----------

We have considered running gptme on other evals, such as SWE-Bench, but have not yet done so.

If you are interested in running gptme on other evals, drop a comment in the issues!
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ See the `README <https://github.com/ErikBjare/gptme/blob/master/README.md>`_ fil
tools
providers
webui
evals
finetuning
cli
api
Expand Down

0 comments on commit 5a0c47e

Please sign in to comment.