Efficient multi-prompt evaluation of LLMs

Welcome to the PromptEval GitHub repository! Here you will find more information about our implementation of PromptEval and datasets introduced in

Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).

Overview

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. This repository introduces our implementation of PromptEval, a method for estimating performance across a large set of prompts by borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets.

Quick start

Please check our demo on how to use PromptEval in your own data.

Repository Structure

data/: Contains the evaluation data used in the experiments.
prompteval/: Source code for the PromptEval method and utilities.
notebooks/: Jupyter notebooks used to create plots for the PromptEval paper.
results/: Results from the experiments conducted in the paper.
mmlu_data/: Contains code for gathering evaluation data.

Installation

To use the code in this repository, clone the repo and install the required dependencies:

git clone https://github.com/felipemaiapolo/prompteval.git
cd prompteval
pip install -e .

Reproducing the main results of the paper

To reproduce the results in our paper, please follow the steps after cloning the repo and installing dependencies:

Download the BBH and LMentry data, produced by the authors of "State of What Art? A Call for Multi-Prompt LLM Evaluation", from here. Place the unzipped folder "raw open-source model responses with gold and auto validation values" inside the data directory;
Process data by running ./prompteval/create_data.py;
Run main experiments by running ./prompteval/dist_evaluation.py. Example: python ./prompteval/dist_evaluation.py --bench 'BBH' --random_seeds 5;
Run best prompt identification by running ./prompteval/bai_evaluation.py. Example: python ./prompteval/bai_evaluation.py --bench 'BBH' --random_seeds 5.
Create plots using the notebooks in the notebooks directory.

Fine-tuning embeddings

To fine-tune BERT representations run the following:

python ./prompteval/ft_representations.py --model_name "bert-base-uncased" \
                             --lr 2e-05 \
                             --weight_decay 1e-06 \
                             --gamma .99995 \
                             --bs 96 \
                             --n_epochs 5 \
                             --warmup_steps 200 \
                             --bench "BBH"

Note, that this requires the file ./data/Ys.pickle to contain correctness data for the respective benchmark as the create_data.py script creates it. Add --push_to_hub, to automatically push the resulting model to your namespace on the huggingface hub (remember to huggingface-cli login before training).

LLM-as-a-judge experiment

To run the LLM-as-a-judge experiment, please follow the steps:

Install AlpacaEval 2.0 using the command pip install alpaca-eval==0.6.4;
Run python ./prompteval/generate_prompts.py to generate prompt variations. Having a GPU will accelerate this step because we use SentenceTransformers to encode texts;
Move the directories ./prompteval/data/templates/AlpacaEval/configs and ./prompteval/data/templates/AlpacaEval/templates to your evaluators_configs AlpacaEval folder; for example, if you are using a Miniconda 3 (or Anaconda) environment, your folder should be in the directory miniconda3/envs/{ENV_NAME}/lib/python{PYTHON_VERSION}/site-packages/alpaca_eval;
Open ./prompteval/evaluate.py and, at the top of the file, create an object called evaluators_configs_path and paste the path to the evaluators_configs directory to it; if you are using a Miniconda 3 (or Anaconda) environment, your evaluators_configs directory should be in the directory home/miniconda3/envs/{ENV_NAME}/lib/python{PYTHON_VERSION}/site-packages/alpaca_eval/evaluators_configs;
Export your OpenAI API key following https://pypi.org/project/alpaca-eval/0.6.4/ and run ./prompteval/evaluate.py to conduct the evaluation step;
Run the notebook ./notebooks/llm_judge_plots.ipynb to get the plots.

MMLU Data

We make our MMLU collected data available on Hugging Face. The data includes evaluation for 15 different SOTA LLMs and 100 different prompt templates.

Citing

@article{polo2024efficient,
title={Efficient multi-prompt evaluation of LLMs},
author={Polo, Felipe Maia and Xu, Ronald and Weber, Lucas and Silva, M{\'\i}rian and Bhardwaj, Onkar and Choshen, Leshem and de Oliveira, Allysson Flavio Melo and Sun, Yuekai and Yurochkin, Mikhail},
journal={arXiv preprint arXiv:2405.17202},
year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
mmlu_data		mmlu_data
notebooks		notebooks
plots		plots
prompteval		prompteval
results		results
LICENSE		LICENSE
README.md		README.md
flake8.txt		flake8.txt
gitignore.txt		gitignore.txt
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient multi-prompt evaluation of LLMs

Overview

Quick start

Repository Structure

Installation

Reproducing the main results of the paper

Fine-tuning embeddings

LLM-as-a-judge experiment

MMLU Data

Citing

About

Releases

Packages

Contributors 2

Languages

License

felipemaiapolo/prompteval

Folders and files

Latest commit

History

Repository files navigation

Efficient multi-prompt evaluation of LLMs

Overview

Quick start

Repository Structure

Installation

Reproducing the main results of the paper

Fine-tuning embeddings

LLM-as-a-judge experiment

MMLU Data

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages