Evaluation Metrics

This folder contains code to compute evaluation metrics on LLM360 models.

In particular, see the following two subfolders:

harness/ provides instructions to evaluate models following the Open LLM Leaderboard.
ppl/ evaluates model per-token perplexity

Our Approach

We provide implementations of evaluations on a variety of benchmarks, including the conventional benchmarks like MMLU, Hellaswag, ARC, user-preference aligned benchmarks like MT-bench, long-context evaluations like LongEval, and additional studies on safety benchmarks for truthfulness, toxicity, and bias. Moreover, we report results on the model samples we preselected from a suite of LLMs where they are all trained on same data seen in the exact same order to better observe and understand how our models develop and evolve over the training process. We also provide public access to all checkpoints, all code and all wandb dashboards for detailed training and evaluation curves.

List of Analysis and Metrics

Below is a list of the main evaluation metrics we use, primarily implemented in the harness/ subfolder. For each model we release, we include the links to specific wandb reports if the evaluation is done. Please refer to model cards (Amber, Crystal) for any terms or definitions. Please stay tuned on the upcoming changes!

Metrics/Analysis	Description	Amber	CrystalCoder
mmlu	A test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more	5 shot	0 shot 5 shot
race	A test to measure reading comprehension ablity	0 shot	0 shot
arc_challenge	A set of grade-school science questions	25 shot	0 shot 25 shot
boolq	A question answering dataset for yes/no questions containing 15942 examples	0 shot	0 shot
hellaswag	A test of commonsense inference	10 shot	0 shot 10 shot
openbookqa	A question-answering dataset modeled after open book exams for assessing human understanding of a subject	0 shot	0 shot
piqa	A test to measure physical commonsense and reasoning	0 shot	0 shot
siqa	A test to measure commonsense reasoning about social interactions	0 shot
winogrande	An adversarial and difficult Winograd benchmark at scale, for commonsense reasoning	0 shot	0 shot 5 shot
crowspairs	A challenge set for evaluating what language models (LMs) on their tendency to generate biased outputs	0 shot
truthfulqa	A test to measure a model’s propensity to reproduce falsehoods commonly found online	0 shot	0 shot
pile	A test to measure model's perplexity, we covered 18/22 sub datasets	perplexity
drop	A reading comprehension benchmark requiring discrete reasoning over paragraphs		3 shot
mbpp	Around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers		pass 1 pass 10
humaneval	A test to measure functional correctness for synthesizing programs from docstrings		pass 1 pass 10
gsm8k	Diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems		5 shot
copa	A test to assess progress in open-domain commonsense causal reasoning		0 shot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation Metrics

Our Approach

List of Analysis and Metrics

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation Metrics

Our Approach

List of Analysis and Metrics