-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #8 from zhimin-z/main
Update leaderboard
- Loading branch information
Showing
2 changed files
with
14 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,7 @@ To use learn how to use the tools directly jump to [Tools Overview](#tools-overv | |
- [Loading the Model](#loading-the-model) | ||
- [Loading the Dataset](#loading-the-dataset) | ||
- [Selecting a Metric](#selecting-a-metric) | ||
- [Observations](#observations) | ||
- [Leaderboard](#leaderboard) | ||
- [Tools Overview](#tools-overview) | ||
- [Getting Started](#getting-started) | ||
- [Usage](#usage) | ||
|
@@ -104,23 +104,22 @@ The most common metrics used for evaluation are: | |
- Subjectivity: Human evaluation can be subjective and prone to biases, making it challenging to establish consistent benchmarks | ||
- Resource-Intensive: Requires human annotators, time, and resources, making it less feasible for large-scale evaluations | ||
|
||
# <a id="observations"></a> Observations | ||
# <a id="leaderboard"></a> Leaderboard | ||
|
||
We provide [LLM Benchmarker Suite Leaderboard](https://llm-evals.formula-labs.com/) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `[email protected]`. | ||
|
||
| Benchmark (Higher is Better) | MPT (7B) | Falcon (7B) | Llama-2 (7B) | Llama-2 (13B) | MPT (30B) | Falcon (40B) | Llama-1 (65B) | Llama-2 (70B) | | ||
*To access the locally created metrics dashboard, initiate a server by navigating to the `static` directory and running the command `python3 -m http.server 8000`. Afterward, open your web browser and visit `http://<Host IP>:8000/`. Alternatively, you can compare your evaluations with ours on the [LLM Benchmarker Suite Leaderboard](https://llm-evals.formula-labs.com/).* | ||
|
||
|:---------------------------:|:--------:|:-----------:|:------------:|:-------------:|:---------:|:------------:|:-------------:|:-------------:| | ||
| *MMLU* | 26.8 | 26.2 | 45.3 | 54.8 | 46.9 | 55.4 | 63.4 | **68.9** | | ||
| *TriviaQA* | 59.6 | 56.8 | 68.9 | 77.2 | 71.3 | 78.6 | 84.5 | **85** | | ||
| *Natural Questions* | 17.8 | 18.1 | 22.7 | 28.0 | 23.0 | 29.5 | 31.0 | **33.0** | | ||
| *GSM8K* | 6.8 | 6.8 | 14.6 | 28.7 | 15.2 | 19.6 | 50.9 | **56.8** | | ||
| *HumanEval* | 18.3 | N/A | 12.8 | 18.3 | 25.0 | N/A | 23.7 | **29.9** | | ||
| *AGIEval* | 23.5 | 21.2 | 29.3 | 39.1 | 33.8 | 37.0 | 47.6 | **54.2** | | ||
| *BoolQ* | 75.0 | 67.5 | 77.4 | 81.7 | 79.0 | 83.1 | **85.3** | 85.0 | | ||
| *HellaSwag* | 76.4 | 74.1 | 77.2 | 80.7 | 79.9 | 83.6 | 84.2 | **85.3** | | ||
| *OpenBookQA* | 51.4 | 51.6 | 58.6 | 57.0 | 52.0 | 56.6 | **60.2** | **60.2** | | ||
| *QuAC* | 37.7 | 18.8 | 39.7 | 44.8 | 41.1 | 43.3 | 39.8 | **49.3** | | ||
| *Winogrande* | 68.3 | 66.3 | 69.2 | 72.8 | 71.0 | 76.9 | 77.0 | **80.2** | | ||
| Model | MMLU | TriviaQA | Natural Questions | GSM8K | HumanEval | AGIEval | BoolQ | HellaSWAG | OpenBookQA | QuAC | Winogrande | | ||
|:------|-----:|---------:|------------------:|------:|----------:|--------:|------:|----------:|-----------:|-----:|-----------:| | ||
| MPT (7B) | 26.8 | 59.6 | 17.8 | 6.8 | 18.3 | 23.5 | 75 | 76.4 | 51.4 | 37.7 | 68.3 | | ||
| Falcon (7B) | 26.2 | 56.8 | 18.1 | 6.8 | nan | 21.2 | 67.5 | 74.1 | 51.6 | 18.8 | 66.3 | | ||
| LLaMA-2 (7B) | 45.3 | 68.9 | 22.7 | 14.6 | 12.8 | 29.3 | 77.4 | 77.2 | 58.6 | 39.7 | 69.2 | | ||
| Llama-2 (13B) | 54.8 | 77.2 | 28 | 28.7 | 18.3 | 39.1 | 81.7 | 80.7 | 57 | 44.8 | 72.8 | | ||
| MPT (30B) | 46.9 | 71.3 | 23 | 15.2 | 25 | 33.8 | 79 | 79.9 | 52 | 41.1 | 71 | | ||
| Falcon (40B) | 55.4 | 78.6 | 29.5 | 19.6 | nan | 37 | 83.1 | 83.6 | 56.6 | 43.3 | 76.9 | | ||
| LLaMA-1 (65B) | 63.4 | 84.5 | 31 | 50.9 | 23.7 | 47.6 | 85.3 | 84.2 | 60.2 | 39.8 | 77 | | ||
| LLaMA-2 (70B) | 68.9 | 85 | 33 | 56.8 | 29.9 | 54.2 | 85 | 85.3 | 60.2 | 49.3 | 80.2 | | ||
|
||
A quick look at the above table reveals several interesting observations: | ||
- Larger models generally perform better across the board. LLaMA-2 70B has the top score on most of the benchmarks. | ||
|
@@ -242,9 +241,6 @@ Static evaluation offers controlled and comparable assessments of a model's perf | |
python opencompass/run.py configs/eval_demo.py -w outputs/demo | ||
``` | ||
|
||
![leaderboard](static/assets/table.png) | ||
*To access the locally created metrics dashboard, initiate a server by navigating to the `static` directory and running the command `python3 -m http.server 8000`. Afterward, open your web browser and visit `http://<Host IP>:8000/`. Alternatively, you can compare your evaluations with ours on the [LLM Benchmarker Suite Leaderboard](https://llm-evals.formula-labs.com/).* | ||
|
||
- **Comprehensive support for models and datasets**: Out-of-box support for 20+ HuggingFace and API models with a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. | ||
|
||
- **Efficient distributed evaluation**: Single line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. | ||
|
@@ -502,12 +498,6 @@ print("RACE dataset:", race_data) | |
|
||
- Ensure that you have internet access to download Hugging Face models and access the proprietary language model APIs. | ||
|
||
## <a id="leaderboard"></a> Leaderboard | ||
|
||
We provide [LLM Benchmarker Suite Leaderboard](https://llm-evals.formula-labs.com/ | ||
) for community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `[email protected]`. | ||
|
||
|
||
## <a id="dataset-support"></a> Dataset Support | ||
|
||
<table align="center"> | ||
|
Binary file not shown.