informativeBench is a comprehensive benchmark designed to evaluate language models and agents in information-asymmetric collaborative environments. It consists of three distinct datasets:
- FriendsTV
- Needle in the Persona
- Schedule
Each dataset in informativeBench is carefully crafted to test different aspects of information retrieval, reasoning, and collaboration in scenarios where information is unevenly distributed among participants.
It's crucial to note that large language models can easily memorize benchmarks, leading to data contamination and leakage. This can result in artificially inflated performance metrics that don't accurately reflect a model's true capabilities.
To address this issue, we've open-sourced our data generation pipelines. This approach allows users to dynamically construct benchmarks tailored to their specific needs, ensuring a more accurate evaluation of their language models and agents in information-asymmetric collaborative environments.
Each dataset folder contains a detailed README with instructions on how to generate and customize the data:
By using these pipelines, you can create unique, contamination-free datasets to robustly evaluate your models and agents.