Skip to content

Commit

Permalink
docs: add "live" eval result output to docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ErikBjare committed Sep 28, 2024
1 parent 9481c91 commit 03f8972
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

projects
demos
eval_results
eval[_-]results*

# logs
*.log
Expand Down
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,14 @@ docs/.clean: docs/conf.py
touch docs/.clean

docs: docs/conf.py docs/*.rst docs/.clean
if [ ! -e eval_results ]; then \
if [ -e eval-results/eval_results ]; then \
ln -s eval-results/eval_results .; \
else \
git fetch origin eval-results; \
git checkout origin/eval-results -- eval_results; \
fi \
fi
poetry run make -C docs html SPHINXOPTS="-W --keep-going"

.PHONY: site
Expand Down
19 changes: 6 additions & 13 deletions docs/evals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,21 +28,14 @@ However, we recommend running it in Docker to improve isolation and reproducibil
gptme-eval hello --model openai/gpt-4o
Example run
-----------

Here's the output from a run of the eval suite:

.. code-block::
Results
-------

$ gptme-eval eval_results/20240917_172916/eval_results.csv
=== Model Comparison ===
Model init-git init-rust hello hello-patch hello-ask init-react prime100
------------------------------------ ---------- ----------- --------- ------------- ----------- ------------ ----------
openai/gpt-4o ✅ 7.74s ✅ 9.62s ✅ 5.02s ✅ 5.06s ✅ 4.69s ❌ timeout ✅ 7.48s
openai/o1-mini ✅ 18.44s ✅ 21.63s ✅ 21.20s ✅ 27.39s ❌ timeout ❌ 42.65s ✅ 17.99s
anthropic/claude-3-5-sonnet-20240620 ❌ timeout ❌ timeout ✅ 8.77s ✅ 7.09s ✅ 8.08s ❌ timeout ✅ 11.26s
Here are the results of the evals we have run so far:

.. command-output:: gptme-eval eval_results/*/eval_results.csv
:cwd: ..
:shell:

We are working on making the evals more robust, informative, and challenging.

Expand Down

0 comments on commit 03f8972

Please sign in to comment.