-
-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
40 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Evals | ||
===== | ||
|
||
gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention? | ||
|
||
To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks. | ||
|
||
.. note:: | ||
The evaluation suite is still under development, but the eval harness is mostly complete. | ||
|
||
You can run the simple ``hello`` eval with gpt-4o like this: | ||
|
||
.. code-block:: bash | ||
gptme-eval hello --model openai/gpt-4o | ||
However, we recommend running it in Docker to improve isolation and reproducibility: | ||
|
||
.. code-block:: bash | ||
make build-docker | ||
docker run \ | ||
-e "OPENAI_API_KEY=<your api key>" \ | ||
-v $(pwd)/eval_results:/app/gptme/eval_results \ | ||
gptme --timeout 60 $@ | ||
Example run | ||
----------- | ||
|
||
Here's the output from a run of the eval suite: TODO | ||
|
||
|
||
Other evals | ||
----------- | ||
|
||
We have considered running gptme on other evals, such as SWE-Bench, but have not yet done so. | ||
|
||
If you are interested in running gptme on other evals, drop a comment in the issues! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters