- 2025/1/23: Add gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B into the leaderboard.
- 2025/1/07: The Open Agent Leaderboard is released.
This project aims to provide a fair comparison of various agents by evaluating their performance on different datasets and LLMs. Built on top of the OmAgent framework, it allows for simple, quick, and accurate assessments of agents.
Supported benchmark datasets:
Supported algorithms:
- IO: Input-Output Direct Prompting (Baseline)
- CoT: Chain-of-thought prompting elicits reasoning in large language models, Large Language Models are Zero-Shot Reasoners
- SC-CoT: Self-Consistency Improves Chain of Thought Reasoning in Language Models
- PoT: Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks
- ReAct: ReAct: Synergizing Reasoning and Acting in Language Models
Supported LLMs:
- gpt-3.5-turbo
- gpt-4o
- Doubao-lite-32k
- Qwen2.5-72B-Instruct
- Qwen2.5-7B-Instruct
- Qwen2-1.5B-Instruct
- Qwen2-0.5B-Instruct
- Llama-3.3-70B-Instruct
- Llama-3.1-8B-Instruct
- Internllm2_5-7B
Math tasks
Rank | Algorithm | LLM | Eval Date | Avg Score | gsm8k-Score | gsm8k-Cost($) | AQuA-Score | AQuA-Cost($) |
---|---|---|---|---|---|---|---|---|
1 | CoT | Qwen2.5-72B-Instruct | 2025/1/22 | 89.55 | 92.87 | 0.7195 | 86.22 | 0.0808 |
2 | SC-CoT | Qwen2.5-72B-Instruct | 2025/1/22 | 89.45 | 93.86 | 5.9858 | 85.04 | 1.0348 |
3 | CoT | Llama-3.3-70B-Instruct | 2025/1/22 | 88.7 | 93.93 | 0.687 | 83.46 | 0.0927 |
4 | SC-CoT | Llama-3.3-70B-Instruct | 2025/1/22 | 88.68 | 95.07 | 6.2005 | 82.28 | 1.0756 |
5 | SC-CoT | gpt-4o | 2025/1/22 | 88.46 | 90.3 | 31.0542 | 86.61 | 8.1485 |
6 | CoT | gpt-4o | 2025/1/22 | 88.39 | 94.09 | 4.5367 | 82.68 | 1.0417 |
7 | IO | Llama-3.3-70B-Instruct | 2025/1/22 | 87.48 | 92.27 | 0.4709 | 82.68 | 0.0798 |
8 | CoT | Doubao-lite-32k | 2025/1/7 | 86 | 89.31 | 0.0558 | 82.68 | 0.0066 |
9 | SC-CoT | Qwen2.5-7B-Instruct | 2025/1/22 | 85.53 | 91.13 | 0 | 79.92 | 0 |
10 | IO | Qwen2.5-72B-Instruct | 2025/1/22 | 85.42 | 86.58 | 0.4899 | 84.25 | 0.0742 |
11 | SC-CoT | Doubao-lite-32k | 2025/1/7 | 84.18 | 87.26 | 0.2083 | 81.1 | 0.0519 |
12 | PoT | gpt-4o | 2025/1/22 | 84.15 | 93.1 | 4.2166 | 75.2 | 1.6087 |
13 | PoT | Qwen2.5-72B-Instruct | 2025/1/22 | 83.77 | 92.34 | 0.7054 | 75.2 | 0.1645 |
14 | ReAct-Pro* | Llama-3.3-70B-Instruct | 2025/1/22 | 83.39 | 87.64 | 10.1124 | 79.13 | 0.768 |
15 | CoT | Qwen2.5-7B-Instruct | 2025/1/22 | 83.19 | 85.67 | 0 | 80.71 | 0 |
16 | IO | gpt-4o | 2025/1/22 | 82 | 88.4 | 3.3463 | 75.59 | 1.1453 |
17 | ReAct-Pro* | Doubao-lite-32k | 2025/1/7 | 81.58 | 85.6 | 0.2512 | 77.56 | 0.0445 |
18 | ReAct-Pro* | Qwen2.5-72B-Instruct | 2025/1/22 | 80.25 | 87.26 | 10.5479 | 73.23 | 0.3177 |
19 | ReAct-Pro* | Qwen2.5-7B-Instruct | 2025/1/22 | 78.64 | 82.87 | 0 | 74.41 | 0 |
20 | PoT | Llama-3.3-70B-Instruct | 2025/1/22 | 76.31 | 73.09 | 0.9736 | 79.53 | 0.1746 |
21 | PoT | Doubao-lite-32k | 2025/1/7 | 75.63 | 79.61 | 0.0576 | 71.65 | 0.0147 |
22 | IO | Doubao-lite-32k | 2025/1/7 | 75.58 | 72.02 | 0.0354 | 79.13 | 0.0058 |
23 | SC-CoT | gpt-3.5-turbo | 2025/1/7 | 73.03 | 79.91 | 3.3938 | 66.14 | 0.7888 |
24 | CoT | gpt-3.5-turbo | 2025/1/7 | 69.86 | 78.7 | 0.6788 | 61.02 | 0.0957 |
25 | ReAct-Pro* | gpt-3.5-turbo | 2025/1/7 | 69.74 | 74.91 | 3.4633 | 64.57 | 0.4928 |
26 | PoT | gpt-3.5-turbo | 2025/1/7 | 68.17 | 76.88 | 0.6902 | 59.45 | 0.1748 |
27 | CoT | Llama-3.1-8B-Instruct | 2025/1/22 | 68.04 | 75.44 | 0 | 60.63 | 0 |
28 | IO | Qwen2.5-7B-Instruct | 2025/1/22 | 67.99 | 57.24 | 0 | 78.74 | 0 |
29 | SC-CoT | Llama-3.1-8B-Instruct | 2025/1/22 | 66.46 | 73.46 | 0 | 59.45 | 0 |
30 | CoT | Internllm2_5-7B | 2025/1/22 | 65.24 | 77.71 | 0 | 52.76 | 0 |
31 | PoT | Qwen2.5-7B-Instruct | 2025/1/22 | 63.47 | 58.83 | 0 | 68.11 | 0 |
32 | ReAct-Pro* | Llama-3.1-8B-Instruct | 2025/1/22 | 61.65 | 67.78 | 0 | 55.51 | 0 |
33 | ReAct-Pro* | gpt-4o | 2025/1/22 | 60.4 | 63.31 | 39.0751 | 57.48 | 2.304 |
34 | IO | Llama-3.1-8B-Instruct | 2025/1/22 | 54.17 | 57.16 | 0 | 51.18 | 0 |
35 | CoT | Qwen2-1.5B-Instruct | 2025/1/22 | 48.03 | 55.5 | 0 | 40.55 | 0 |
36 | SC-CoT | Internllm2_5-7B | 2025/1/22 | 43.8 | 48.22 | 0 | 39.37 | 0 |
37 | IO | gpt-3.5-turbo | 2025/1/7 | 38.41 | 37.83 | 0.3328 | 38.98 | 0.038 |
38 | PoT | Llama-3.1-8B-Instruct | 2025/1/22 | 37.64 | 38.67 | 0 | 36.61 | 0 |
39 | PoT | Internllm2_5-7B | 2025/1/22 | 37.41 | 38.21 | 0 | 36.61 | 0 |
40 | ReAct-Pro* | Internllm2_5-7B | 2025/1/22 | 37.23 | 33.51 | 0 | 40.94 | 0 |
41 | CoT | Qwen2-0.5B-Instruct | 2025/1/22 | 34.51 | 35.94 | 0 | 33.07 | 0 |
42 | IO | Internllm2_5-7B | 2025/1/22 | 29.62 | 11.6 | 0 | 47.64 | 0 |
43 | ReAct-Pro* | Qwen2-1.5B-Instruct | 2025/1/22 | 25.23 | 24.87 | 0 | 25.59 | 0 |
44 | PoT | Qwen2-1.5B-Instruct | 2025/1/22 | 24.61 | 18.5 | 0 | 30.71 | 0 |
45 | IO | Qwen2-1.5B-Instruct | 2025/1/22 | 22.91 | 16.68 | 0 | 29.13 | 0 |
46 | IO | Qwen2-0.5B-Instruct | 2025/1/22 | 20.94 | 14.71 | 0 | 27.17 | 0 |
47 | SC-CoT | Qwen2-1.5B-Instruct | 2025/1/22 | 17.69 | 11.75 | 0 | 23.62 | 0 |
48 | ReAct-Pro* | Qwen2-0.5B-Instruct | 2025/1/22 | 15.84 | 7.66 | 0 | 24.02 | 0 |
49 | PoT | Qwen2-0.5B-Instruct | 2025/1/22 | 13.47 | 9.62 | 0 | 17.32 | 0 |
50 | SC-CoT | Qwen2-0.5B-Instruct | 2025/1/22 | 12.25 | 1.67 | 0 | 22.83 | 0 |
Evaluation details can be found in the Evaluation Details section and huggingface leaderboard.
-
IO (Input-Output) is the baseline method that directly prompts the model with the question and expects an answer without any intermediate reasoning steps. It represents the most basic way of using language models and serves as a reference point for evaluating the effectiveness of other algorithms.
-
ReAct-Pro*: We modified ReAct to ReAct-Pro, following the Reflexion repository. Comparasion with the original ReAct repo can be found in the Compare to ReAct section.
-
Clone the repository:
git clone https://github.com/om-ai-lab/open-agent-leaderboard.git cd open-agent-leaderboard
-
Install dependencies:
pip install -r requirements.txt
Step 1. Implement your agent in the omagent
repository
Navigate to the agent repository:
git clone https://github.com/om-ai-lab/OmAgent.git
cd OmAgent
Set up the environment:
pip install -e omagent-core
Implement your agent in the omagent
repository, check the examples/cot
folder.
Run the inference script (cot as an example):
cd examples/cot
python eval_demo.py --model_id your_model_id --dataset_name your_dataset_name --dataset_path your_dataset_path --output_path your_output_path --output_name your_output_name --cot_method your_cot_method
The output results are saved in JSON format and include the following fields:
id
: The unique identifier of the sample.question
: The input question provided to the model.last_output
: The raw output generated by the model.output_postprocess
(optional): The processed output after cleansing.ground_truth
(optional): The correct answer for the sample.prompt_tokens
: The number of tokens in the input prompt.completion_tokens
: The number of tokens in the model's output.
Example of an output JSON file:
{
"dataset": "gsm8k",
"model_id": "gpt-3.5-turbo",
"alg": "COT",
"model_result": [
{
"id": 1,
"question": "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today.....",
"last_output": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and uses 4 to bake muffins,...",
"output_postprocess": "Paris",
"ground_truth": "Paris",
"prompt_tokens": 10,
"completion_tokens": 5
}
]
}
Run the main script to perform evaluations:
python main.py --dataset <dataset_name> --model <model_name> --method <method_name> --output_dir <output_directory>
--random_seed
: Random seed, default is 1.--dataset
: Dataset to use, options areaqua
,gsm8k
.--minibatch_size
: Minibatch size, default is 1.--max_num_worker
: Maximum number of workers for the data loader, default is 4.--model
: Model used for decoding, options aregpt-4o-mini
,gpt-4o
,gpt-3.5-turbo
.--method
: Method, options arezero_shot
,zero_shot_cot
,few_shot
,few_shot_cot
.--cot_trigger_no
: Trigger sentence number for chain of thought, default is 1.--max_length
: Maximum length of model output, default is 2048.--max_length_direct
: Maximum length of direct model answer, default is 32.--limit_dataset_size
: Whether to limit the test dataset size, default is 0 (no limit).--output_dir
: Output directory, default is./outputs/
.--output_path
: Output path, default is empty.--agent
: Agent used for the experiment, options arecot
,pot
,sc_cot
,react
.--system_prompt
: System prompt, default is empty.--openai_api_key
: OpenAI API key, default is empty.--openai_url
: OpenAI API URL, default ishttps://api.openai.com/v1
.
python main.py --output_path example/gsm8k_results_cot.json --dataset gsm8k --method few_shot_cot
Algorithm | Dataset | Eval Date | LLM | Score | Pass rate | X-shot | Parameters | Samples | Total input tokens | Average input tokens | Total output tokens | Average output tokens | All tokens | Cost($) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IO | gsm8k | 2025/1/7 | gpt-3.5-turbo | 37.83 | 99.92 | 8 | 1319 | 546,990 | 415 | 39,563 | 30 | 586,553 | 0.3328 | |
IO | gsm8k | 2025/1/7 | Doubao-lite-32k | 72.02 | 99.92 | 8 | 1319 | 617,377 | 468 | 123,106 | 93 | 740,483 | 0.0354 | |
IO | gsm8k | 2025/1/22 | gpt-4o | 88.4 | 100 | 8 | 1319 | 542,416 | 411 | 199,030 | 151 | 741,446 | 3.3463 | |
IO | gsm8k | 2025/1/22 | Qwen2.5-72B-Instruct | 86.58 | 100 | 8 | 1319 | 555,340 | 421 | 313,720 | 238 | 869,060 | 0.4899 | |
IO | gsm8k | 2025/1/22 | Llama-3.3-70B-Instruct | 92.27 | 100 | 8 | 1319 | 583,916 | 443 | 251,359 | 191 | 835,275 | 0.4709 | |
IO | gsm8k | 2025/1/22 | Qwen2.5-7B-Instruct | 57.24 | 100 | 8 | 1319 | 596,229 | 452 | 291,684 | 221 | 887,913 | 0.0000 | |
IO | gsm8k | 2025/1/22 | Llama-3.1-8B-Instruct | 57.16 | 99.55 | 8 | 1319 | 550,941 | 418 | 1,194,488 | 906 | 1,745,429 | 0.0000 | |
IO | gsm8k | 2025/1/22 | Internllm2_5-7B | 11.6 | 97.95 | 8 | 1319 | 679,302 | 515 | 434,426 | 329 | 1,113,728 | 0.0000 | |
IO | gsm8k | 2025/1/22 | Qwen2-1.5B-Instruct | 16.68 | 100 | 8 | 1319 | 568,530 | 431 | 168,466 | 128 | 736,996 | 0.0000 | |
IO | gsm8k | 2025/1/22 | Qwen2-0.5B-Instruct | 14.71 | 100 | 8 | 1319 | 568,116 | 431 | 266,781 | 202 | 834,897 | 0.0000 | |
ReAct-Pro* | gsm8k | 2025/1/7 | gpt-3.5-turbo | 74.91 | 99.39 | 8 | max_steps=10 | 1319 | 6,506,164 | 4,933 | 140,122 | 106 | 6,646,286 | 3.4633 |
ReAct-Pro* | gsm8k | 2025/1/7 | Doubao-lite-32k | 85.6 | 99.62 | 8 | max_steps=10 | 1319 | 5,862,016 | 4,444 | 136,623 | 104 | 5,998,639 | 0.2512 |
ReAct-Pro* | gsm8k | 2025/1/22 | gpt-4o | 63.31 | 99.55 | 8 | max_steps=10 | 1319 | 14,411,173 | 10,926 | 304,714 | 231 | 14,715,887 | 39.0751 |
ReAct-Pro* | gsm8k | 2025/1/22 | Qwen2.5-72B-Instruct | 87.26 | 100 | 8 | max_steps=10 | 1319 | 18,160,983 | 13,769 | 549,454 | 417 | 18,710,437 | 10.5479 |
ReAct-Pro* | gsm8k | 2025/1/22 | Llama-3.3-70B-Instruct | 87.64 | 99.92 | 8 | max_steps=10 | 1319 | 17,038,928 | 12,918 | 898,936 | 682 | 17,937,864 | 10.1124 |
ReAct-Pro* | gsm8k | 2025/1/22 | Qwen2.5-7B-Instruct | 82.87 | 100 | 8 | max_steps=10 | 1319 | 14,355,752 | 10,884 | 495,162 | 375 | 14,850,914 | 0.0000 |
ReAct-Pro* | gsm8k | 2025/1/22 | Llama-3.1-8B-Instruct | 67.78 | 98.56 | 8 | max_steps=10 | 1319 | 21,044,978 | 15,955 | 1,790,789 | 1,358 | 22,835,767 | 0.0000 |
ReAct-Pro* | gsm8k | 2025/1/22 | Internllm2_5-7B | 33.51 | 97.95 | 8 | max_steps=10 | 1319 | 30,120,070 | 22,836 | 5,549,919 | 4,208 | 35,669,989 | 0.0000 |
ReAct-Pro* | gsm8k | 2025/1/22 | Qwen2-1.5B-Instruct | 24.87 | 80.21 | 8 | max_steps=10 | 1319 | 9,133,603 | 6,925 | 694,398 | 526 | 9,828,001 | 0.0000 |
ReAct-Pro* | gsm8k | 2025/1/22 | Qwen2-0.5B-Instruct | 7.66 | 95.22 | 8 | max_steps=10 | 1319 | 52,431,343 | 39,751 | 2,961,268 | 2,245 | 55,392,611 | 0.0000 |
PoT | gsm8k | 2025/1/7 | gpt-3.5-turbo | 76.88 | 99.24 | 8 | 1319 | 1,090,418 | 827 | 96,662 | 73 | 1,187,080 | 0.6902 | |
PoT | gsm8k | 2025/1/7 | Doubao-lite-32k | 79.61 | 92.57 | 8 | 1319 | 1,170,038 | 887 | 118,017 | 89 | 1,288,055 | 0.0576 | |
PoT | gsm8k | 2025/1/22 | gpt-4o | 93.1 | 99.77 | 8 | 1319 | 1,101,672 | 835 | 146,240 | 111 | 1,247,912 | 4.2166 | |
PoT | gsm8k | 2025/1/22 | Qwen2.5-72B-Instruct | 92.34 | 99.39 | 8 | 1319 | 1,106,682 | 839 | 144,528 | 110 | 1,251,210 | 0.7054 | |
PoT | gsm8k | 2025/1/22 | Llama-3.3-70B-Instruct | 73.09 | 79.61 | 8 | 1319 | 1,126,025 | 854 | 601,019 | 456 | 1,727,044 | 0.9736 | |
PoT | gsm8k | 2025/1/22 | Qwen2.5-7B-Instruct | 58.83 | 70.51 | 8 | 1319 | 1,145,390 | 868 | 217,432 | 165 | 1,362,822 | 0.0000 | |
PoT | gsm8k | 2025/1/22 | Llama-3.1-8B-Instruct | 38.67 | 55.42 | 8 | 1319 | 1,147,538 | 870 | 243,573 | 185 | 1,391,111 | 0.0000 | |
PoT | gsm8k | 2025/1/22 | Internllm2_5-7B | 38.21 | 48.9 | 8 | 1319 | 1,136,843 | 862 | 188,106 | 143 | 1,324,949 | 0.0000 | |
PoT | gsm8k | 2025/1/22 | Qwen2-1.5B-Instruct | 18.5 | 31.01 | 8 | 1319 | 1,151,528 | 873 | 175,994 | 133 | 1,327,522 | 0.0000 | |
PoT | gsm8k | 2025/1/22 | Qwen2-0.5B-Instruct | 9.62 | 16.91 | 8 | 1319 | 1,151,528 | 873 | 237,607 | 180 | 1,389,135 | 0.0000 | |
CoT | gsm8k | 2025/1/7 | gpt-3.5-turbo | 78.7 | 100 | 8 | 1319 | 953,242 | 723 | 134,799 | 102 | 1,088,041 | 0.6788 | |
CoT | gsm8k | 2025/1/7 | Doubao-lite-32k | 89.31 | 100 | 8 | 1319 | 1,042,095 | 790 | 159,725 | 121 | 1,201,820 | 0.0558 | |
CoT | gsm8k | 2025/1/22 | gpt-4o | 94.09 | 100 | 8 | 1319 | 948,668 | 719 | 216,498 | 164 | 1,165,166 | 4.5367 | |
CoT | gsm8k | 2025/1/22 | Qwen2.5-72B-Instruct | 92.87 | 100 | 8 | 1319 | 1,005,119 | 762 | 271,133 | 206 | 1,276,252 | 0.7195 | |
CoT | gsm8k | 2025/1/22 | Llama-3.3-70B-Instruct | 93.93 | 100 | 8 | 1319 | 990,168 | 751 | 228,497 | 173 | 1,218,665 | 0.6870 | |
CoT | gsm8k | 2025/1/22 | Qwen2.5-7B-Instruct | 85.67 | 100 | 8 | 1319 | 1,046,008 | 793 | 244,797 | 186 | 1,290,805 | 0.0000 | |
CoT | gsm8k | 2025/1/22 | Llama-3.1-8B-Instruct | 75.44 | 99.92 | 8 | 1319 | 990,168 | 751 | 258,161 | 196 | 1,248,329 | 0.0000 | |
CoT | gsm8k | 2025/1/22 | Internllm2_5-7B | 77.71 | 99.7 | 8 | 1319 | 968,163 | 734 | 234,000 | 177 | 1,202,163 | 0.0000 | |
CoT | gsm8k | 2025/1/22 | Qwen2-1.5B-Instruct | 55.5 | 100 | 8 | 1319 | 1,032,818 | 783 | 185,707 | 141 | 1,218,525 | 0.0000 | |
CoT | gsm8k | 2025/1/22 | Qwen2-0.5B-Instruct | 35.94 | 99.92 | 8 | 1319 | 1,032,818 | 783 | 190,641 | 145 | 1,223,459 | 0.0000 | |
SC-CoT | gsm8k | 2025/1/7 | gpt-3.5-turbo | 79.91 | 99.92 | 8 | temperature=1, path_num=5 | 1319 | 2,740,652 | 2,078 | 1,348,960 | 1,023 | 4,089,612 | 3.3938 |
SC-CoT | gsm8k | 2025/1/7 | Doubao-lite-32k | 87.26 | 99.92 | 8 | temperature=1, path_num=5 | 1319 | 2,691,714 | 2,041 | 1,197,099 | 908 | 3,888,813 | 0.2083 |
SC-CoT | gsm8k | 2025/1/22 | gpt-4o | 90.3 | 99.92 | 8 | temperature=1, path_num=5 | 1319 | 3,590,336 | 2,722 | 2,207,837 | 1,674 | 5,798,173 | 31.0542 |
SC-CoT | gsm8k | 2025/1/22 | Qwen2.5-72B-Instruct | 93.86 | 100 | 8 | temperature=1, path_num=5 | 1319 | 8,136,223 | 6,168 | 2,481,785 | 1,882 | 10,618,008 | 5.9858 |
SC-CoT | gsm8k | 2025/1/22 | Llama-3.3-70B-Instruct | 95.07 | 100 | 8 | temperature=1, path_num=5 | 1319 | 8,413,717 | 6,379 | 2,585,077 | 1,960 | 10,998,794 | 6.2005 |
SC-CoT | gsm8k | 2025/1/22 | Qwen2.5-7B-Instruct | 91.13 | 100 | 8 | temperature=1, path_num=5 | 1319 | 8,586,888 | 6,510 | 2,554,097 | 1,936 | 11,140,985 | 0.0000 |
SC-CoT | gsm8k | 2025/1/22 | Llama-3.1-8B-Instruct | 73.46 | 99.55 | 8 | temperature=1, path_num=5 | 1319 | 8,630,514 | 6,543 | 3,148,202 | 2,387 | 11,778,716 | 0.0000 |
SC-CoT | gsm8k | 2025/1/22 | Internllm2_5-7B | 48.22 | 98.41 | 8 | temperature=1, path_num=5 | 1319 | 10,678,792 | 8,096 | 3,847,639 | 2,917 | 14,526,431 | 0.0000 |
SC-CoT | gsm8k | 2025/1/22 | Qwen2-1.5B-Instruct | 11.75 | 91.89 | 8 | temperature=1, path_num=5 | 1319 | 9,066,115 | 6,873 | 3,345,827 | 2,537 | 12,411,942 | 0.0000 |
SC-CoT | gsm8k | 2025/1/22 | Qwen2-0.5B-Instruct | 1.67 | 94.69 | 8 | temperature=1, path_num=5 | 1319 | 11,019,864 | 8,355 | 5,445,856 | 4,129 | 16,465,720 | 0.0000 |
IO | AQuA | 2025/1/7 | gpt-3.5-turbo | 38.98 | 100 | 0 | 254 | 25,701 | 101 | 16,770 | 66 | 42,471 | 0.0380 | |
IO | AQuA | 2025/1/7 | Doubao-lite-32k | 79.13 | 100 | 0 | 254 | 33,058 | 130 | 54,684 | 215 | 87,742 | 0.0058 | |
IO | AQuA | 2025/1/22 | gpt-4o | 75.59 | 97.24 | 0 | 254 | 25,631 | 101 | 108,121 | 426 | 133,752 | 1.1453 | |
IO | AQuA | 2025/1/22 | Qwen2.5-72B-Instruct | 84.25 | 99.61 | 0 | 254 | 25,397 | 100 | 106,207 | 418 | 131,604 | 0.0742 | |
IO | AQuA | 2025/1/22 | Llama-3.3-70B-Instruct | 82.68 | 99.21 | 0 | 254 | 32,809 | 129 | 108,758 | 428 | 141,567 | 0.0798 | |
IO | AQuA | 2025/1/22 | Qwen2.5-7B-Instruct | 78.74 | 98.43 | 0 | 254 | 33,271 | 131 | 104,500 | 411 | 137,771 | 0.0000 | |
IO | AQuA | 2025/1/22 | Llama-3.1-8B-Instruct | 51.18 | 98.82 | 0 | 254 | 26,459 | 104 | 106,647 | 420 | 133,106 | 0.0000 | |
IO | AQuA | 2025/1/22 | Internllm2_5-7B | 47.64 | 90.94 | 0 | 254 | 50,232 | 198 | 134,809 | 531 | 185,041 | 0.0000 | |
IO | AQuA | 2025/1/22 | Qwen2-1.5B-Instruct | 29.13 | 97.64 | 0 | 254 | 27,937 | 110 | 43,110 | 170 | 71,047 | 0.0000 | |
IO | AQuA | 2025/1/22 | Qwen2-0.5B-Instruct | 27.17 | 98.82 | 0 | 254 | 27,937 | 110 | 82,478 | 325 | 110,415 | 0.0000 | |
CoT | AQuA | 2025/1/7 | gpt-3.5-turbo | 61.02 | 93.7 | 0 | 254 | 25,447 | 100 | 55,346 | 218 | 80,793 | 0.0957 | |
CoT | AQuA | 2025/1/7 | Doubao-lite-32k | 82.68 | 97.24 | 0 | 254 | 27,978 | 110 | 66,599 | 262 | 94,577 | 0.0066 | |
CoT | AQuA | 2025/1/22 | gpt-4o | 82.68 | 98.03 | 0 | 254 | 25,123 | 99 | 97,894 | 385 | 123,017 | 1.0417 | |
CoT | AQuA | 2025/1/22 | Qwen2.5-72B-Instruct | 86.22 | 99.21 | 0 | 254 | 25,143 | 99 | 118,146 | 465 | 143,289 | 0.0808 | |
CoT | AQuA | 2025/1/22 | Llama-3.3-70B-Instruct | 83.46 | 98.43 | 0 | 254 | 32,555 | 128 | 131,834 | 519 | 164,389 | 0.0927 | |
CoT | AQuA | 2025/1/22 | Qwen2.5-7B-Instruct | 80.71 | 99.61 | 0 | 254 | 33,017 | 130 | 116,719 | 460 | 149,736 | 0.0000 | |
CoT | AQuA | 2025/1/22 | Llama-3.1-8B-Instruct | 60.63 | 100 | 0 | 254 | 32,555 | 128 | 111,880 | 440 | 144,435 | 0.0000 | |
CoT | AQuA | 2025/1/22 | Internllm2_5-7B | 52.76 | 89.37 | 0 | 254 | 26,610 | 105 | 100,910 | 397 | 127,520 | 0.0000 | |
CoT | AQuA | 2025/1/22 | Qwen2-1.5B-Instruct | 40.55 | 98.82 | 0 | 254 | 30,477 | 120 | 79,563 | 313 | 110,040 | 0.0000 | |
CoT | AQuA | 2025/1/22 | Qwen2-0.5B-Instruct | 33.07 | 98.82 | 0 | 254 | 30,477 | 120 | 86,862 | 342 | 117,339 | 0.0000 | |
PoT | AQuA | 2025/1/7 | gpt-3.5-turbo | 59.45 | 100 | 0 | 254 | 225,162 | 886 | 41,492 | 163 | 266,654 | 0.1748 | |
PoT | AQuA | 2025/1/7 | Doubao-lite-32k | 71.65 | 96.85 | 0 | 254 | 259,863 | 1,023 | 49,573 | 195 | 309,436 | 0.0147 | |
PoT | AQuA | 2025/1/22 | gpt-4o | 75.2 | 100 | 0 | 254 | 222,717 | 877 | 105,191 | 414 | 327,908 | 1.6087 | |
PoT | AQuA | 2025/1/22 | Qwen2.5-72B-Instruct | 75.2 | 100 | 0 | 254 | 249,215 | 981 | 42,549 | 168 | 291,764 | 0.1645 | |
PoT | AQuA | 2025/1/22 | Llama-3.3-70B-Instruct | 79.53 | 99.21 | 0 | 254 | 240,735 | 948 | 69,064 | 272 | 309,799 | 0.1746 | |
PoT | AQuA | 2025/1/22 | Qwen2.5-7B-Instruct | 68.11 | 100 | 0 | 254 | 264,517 | 1,041 | 49,211 | 194 | 313,728 | 0.0000 | |
PoT | AQuA | 2025/1/22 | Llama-3.1-8B-Instruct | 36.61 | 96.85 | 0 | 254 | 240,613 | 947 | 50,301 | 198 | 290,914 | 0.0000 | |
PoT | AQuA | 2025/1/22 | Internllm2_5-7B | 36.61 | 98.82 | 0 | 254 | 233,505 | 919 | 68,457 | 270 | 301,962 | 0.0000 | |
PoT | AQuA | 2025/1/22 | Qwen2-1.5B-Instruct | 30.71 | 96.46 | 0 | 254 | 246,560 | 971 | 51,915 | 204 | 298,475 | 0.0000 | |
PoT | AQuA | 2025/1/22 | Qwen2-0.5B-Instruct | 17.32 | 92.13 | 0 | 254 | 258,867 | 1,019 | 63,414 | 250 | 322,281 | 0.0000 | |
SC-CoT | AQuA | 2025/1/7 | gpt-3.5-turbo | 66.14 | 99.21 | 0 | temperature=1, path_num=5 | 254 | 482,192 | 1,898 | 365,143 | 1,438 | 847,335 | 0.7888 |
SC-CoT | AQuA | 2025/1/7 | Doubao-lite-32k | 81.1 | 97.24 | 0 | temperature=1, path_num=5 | 254 | 503,751 | 1,983 | 382,235 | 1,505 | 885,986 | 0.0519 |
SC-CoT | AQuA | 2025/1/22 | gpt-4o | 86.61 | 98.82 | 0 | temperature=1, path_num=5 | 254 | 744,478 | 2,931 | 628,728 | 2,475 | 1,373,206 | 8.1485 |
SC-CoT | AQuA | 2025/1/22 | Qwen2.5-72B-Instruct | 85.04 | 99.21 | 0 | temperature=1, path_num=5 | 254 | 1,051,218 | 4,139 | 784,451 | 3,088 | 1,835,669 | 1.0348 |
SC-CoT | AQuA | 2025/1/22 | Llama-3.3-70B-Instruct | 82.28 | 99.21 | 0 | temperature=1, path_num=5 | 254 | 1,135,251 | 4,469 | 772,673 | 3,042 | 1,907,924 | 1.0756 |
SC-CoT | AQuA | 2025/1/22 | Qwen2.5-7B-Instruct | 79.92 | 100 | 0 | temperature=1, path_num=5 | 254 | 1,098,280 | 4,324 | 747,052 | 2,941 | 1,845,332 | 0.0000 |
SC-CoT | AQuA | 2025/1/22 | Llama-3.1-8B-Instruct | 59.45 | 97.24 | 0 | temperature=1, path_num=5 | 254 | 971,003 | 3,823 | 680,330 | 2,678 | 1,651,333 | 0.0000 |
SC-CoT | AQuA | 2025/1/22 | Internllm2_5-7B | 39.37 | 98.03 | 0 | temperature=1, path_num=5 | 254 | 1,420,494 | 5,592 | 875,728 | 3,448 | 2,296,222 | 0.0000 |
SC-CoT | AQuA | 2025/1/22 | Qwen2-1.5B-Instruct | 23.62 | 96.46 | 0 | temperature=1, path_num=5 | 254 | 1,034,362 | 4,072 | 740,973 | 2,917 | 1,775,335 | 0.0000 |
SC-CoT | AQuA | 2025/1/22 | Qwen2-0.5B-Instruct | 22.83 | 97.24 | 0 | temperature=1, path_num=5 | 254 | 1,246,929 | 4,909 | 968,162 | 3,812 | 2,215,091 | 0.0000 |
ReAct-Pro* | AQuA | 2025/1/7 | gpt-3.5-turbo | 64.57 | 98.03 | 0 | max_steps=10 | 254 | 862,614 | 3,396 | 40,973 | 161 | 903,587 | 0.4928 |
ReAct-Pro* | AQuA | 2025/1/7 | Doubao-lite-32k | 77.56 | 96.06 | 0 | max_steps=10 | 254 | 977,890 | 3,850 | 54,951 | 216 | 1,032,841 | 0.0445 |
ReAct-Pro* | AQuA | 2025/1/22 | gpt-4o | 57.48 | 97.24 | 0 | max_steps=10 | 254 | 615,589 | 2,424 | 76,507 | 301 | 692,096 | 2.3040 |
ReAct-Pro* | AQuA | 2025/1/22 | Qwen2.5-72B-Instruct | 73.23 | 100 | 0 | max_steps=10 | 254 | 441,765 | 1,739 | 121,838 | 480 | 563,603 | 0.3177 |
ReAct-Pro* | AQuA | 2025/1/22 | Llama-3.3-70B-Instruct | 79.13 | 99.61 | 0 | max_steps=10 | 254 | 1,119,143 | 4,406 | 243,236 | 958 | 1,362,379 | 0.7680 |
ReAct-Pro* | AQuA | 2025/1/22 | Qwen2.5-7B-Instruct | 74.41 | 99.21 | 0 | max_steps=10 | 254 | 564,165 | 2,221 | 131,679 | 518 | 695,844 | 0.0000 |
ReAct-Pro* | AQuA | 2025/1/22 | Llama-3.1-8B-Instruct | 55.51 | 96.85 | 0 | max_steps=10 | 254 | 3,764,723 | 14,822 | 576,098 | 2,268 | 4,340,821 | 0.0000 |
ReAct-Pro* | AQuA | 2025/1/22 | Internllm2_5-7B | 40.94 | 96.85 | 0 | max_steps=10 | 254 | 3,592,039 | 14,142 | 836,762 | 3,294 | 4,428,801 | 0.0000 |
ReAct-Pro* | AQuA | 2025/1/22 | Qwen2-1.5B-Instruct | 25.59 | 96.06 | 0 | max_steps=10 | 254 | 4,555,858 | 17,936 | 516,146 | 2,032 | 5,072,004 | 0.0000 |
ReAct-Pro* | AQuA | 2025/1/22 | Qwen2-0.5B-Instruct | 24.02 | 96.85 | 0 | max_steps=10 | 254 | 6,344,167 | 24,977 | 825,920 | 3,252 | 7,170,087 | 0.0000 |
Default settings:
temperature = 0
LLM prices:
- LLM prices:
- gpt-3.5-turbo:
- 0.5$/1M tokens (input)
- 1.5$/1M tokens (output)
- Doubao-lite-32k (1 USD = 7.3249 CNY):
- 0.04096$/1M tokens (input)
- 0.08200$/1M tokens (output)
- gpt-4o-2024-08-06:
- 2.50$ /1M input tokens (input)
- 10$ /1M output tokens (output)
- Qwen2.5-7B-Instruct and Llama-3.3-70B-Instruct:
- Prices can be found https://cloud.siliconflow.cn/.
- Other open source LLMs:
- Deployed locally, please check the OmAgent repository for more information.
- Cost is not considered in the leaderboard.
- gpt-3.5-turbo:
Pass Rate*: The pass rate is calculated by evaluating the percentage of predictions that are valid, where a prediction is valid if it is neither empty nor null.
Algorithm | Dataset | Eval Time | LLM | Framework | Score |
---|---|---|---|---|---|
CoT | gsm8k | 2025/1/7 | gpt-3.5-turbo | Original repo | 79.23 |
CoT | gsm8k | 2025/1/7 | gpt-3.5-turbo | OmAgent | 78.70 |
CoT | AQuA | 2025/1/7 | gpt-3.5-turbo | Original repo | 60.63 |
CoT | AQuA | 2025/1/7 | gpt-3.5-turbo | OmAgent | 61.02 |
PoT | gsm8k | 2025/1/7 | gpt-4o-mini | Original repo | 86.35 |
PoT | gsm8k | 2025/1/7 | gpt-4o-mini | OmAgent | 88.25 |
ReAct | AQuA | 2025/1/7 | gpt-3.5-turbo | Original repo | 35.04 |
ReAct | AQuA | 2025/1/7 | gpt-3.5-turbo | OmAgent | 34.25 |
ReAct | HotpotQA | 2025/1/8 | gpt-3.5-turbo | Original repo | 28.00 |
ReAct | HotpotQA | 2025/1/8 | gpt-3.5-turbo | OmAgent | 27.40 |
Note:
- The original repo is the official repository of the agent implementation.
- OmAgent is the implementation of the agent in this project.
- There is no official implementation of SC-CoT.
Algorithm | Dataset | Eval Time | LLM | Score | Pass Rate |
---|---|---|---|---|---|
ReAct | gsm8k | 2025/1/7 | gpt-3.5-turbo | 38.13 | 100.00 |
ReAct-Pro | gsm8k | 2025/1/7 | gpt-3.5-turbo | 74.91 | 99.39 |
ReAct | AQuA | 2025/1/7 | gpt-3.5-turbo | 34.25 | 97.64 |
ReAct-Pro | AQuA | 2025/1/7 | gpt-3.5-turbo | 64.57 | 98.03 |
Open Agent Leaderboard is built on top of the OmAgent repository.
Acknowledgments
We extend our deepest gratitude to the authors and contributors of the following datasets: gsm8k, AQuA, agent algorithms: CoT, SC-CoT, PoT, ReAct, and LLMs: gpt-3.5-turbo, Doubao-lite-32k.
If you find our repository beneficial, please cite our repository:
@misc{open-agent-leaderboard,
title={Open Agent Leaderboard},
author={Om AI Lab},
year={2025},
publisher={GitHub},
howpublished={\url{https://github.com/om-ai-lab/open-agent-leaderboard}}
}
You can follow us on X and Discord for more updates and discussions.
Feel free to submit issues and pull requests.
This project is licensed under the MIT License.