Turning FastChat upside down to benchmark open-source LLMs

Winning percentage of an all-against-all competition. 70 questions asked to each model. Answers evaluated by GPT-3.5 (API). Shown are averages and std. dev. of 3 replicates per model. Control: GPT-3.5 answers “shifted” = answers not related to question asked. Bottom: Human preference as Elo ratings, assessed in the LMSYS chatbot arena.

https://medium.com/@geronimo7/open-source-chatbots-in-the-wild-9a44d7a41a48

export APIKEY="YOURKEY"
python3 get_review.py -r "gpt-3.5-pairwise" -a table/answer/answer_vicuna-13b.jsonl table/answer/answer_pythia-12b-sft-v8-7k-steps.jsonl -o table/review_pairwise/review_vicuna-13b_pythia-12b-sft-v8-7k-steps -dr 3 -k "$APIKEY"

# calc winning percentage
python3 reviews_winrate_pairwise.py -r table/reviews_pairwise

# generate html
python3 generate_webpage_data_from_table.py -r table/reviews_pairwise

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
table		table
webpage		webpage
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
fix_parseWinner.py		fix_parseWinner.py
generate_webpage_data_from_table.py		generate_webpage_data_from_table.py
get_model_answer.py		get_model_answer.py
get_review.py		get_review.py
get_review_helper.py		get_review_helper.py
randomize_model_answer.py		randomize_model_answer.py
reviews_winrate_pairwise.py		reviews_winrate_pairwise.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turning FastChat upside down to benchmark open-source LLMs

About

Releases

Packages

Languages

g588928812/FastChat_eval

Folders and files

Latest commit

History

Repository files navigation

Turning FastChat upside down to benchmark open-source LLMs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages