informativeBench

informativeBench is a comprehensive benchmark designed to evaluate language models and agents in information-asymmetric collaborative environments. It consists of three distinct datasets:

FriendsTV
Needle in the Persona
Schedule

Overview

Each dataset in informativeBench is carefully crafted to test different aspects of information retrieval, reasoning, and collaboration in scenarios where information is unevenly distributed among participants.

Importance of Dynamic Benchmarking

It's crucial to note that large language models can easily memorize benchmarks, leading to data contamination and leakage. This can result in artificially inflated performance metrics that don't accurately reflect a model's true capabilities.

To address this issue, we've open-sourced our data generation pipelines. This approach allows users to dynamically construct benchmarks tailored to their specific needs, ensuring a more accurate evaluation of their language models and agents in information-asymmetric collaborative environments.

Dataset Pipelines

Each dataset folder contains a detailed README with instructions on how to generate and customize the data:

FriendsTV Pipeline
Needle in the Persona Pipeline
Schedule Pipeline

By using these pipelines, you can create unique, contamination-free datasets to robustly evaluate your models and agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

informativeBench

Overview

Importance of Dynamic Benchmarking

Dataset Pipelines

Files

README.md

Latest commit

History

README.md

File metadata and controls

informativeBench

Overview

Importance of Dynamic Benchmarking

Dataset Pipelines