Skip to content

Benchmarking Agentic LLM and VLM Reasoning On Games

License

Notifications You must be signed in to change notification settings

synth-laboratories/BALROG

 
 

Repository files navigation

BALROG Agent


BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

BALROG is a novel benchmark evaluating agentic LLM and VLM capabilities on long-horizon interactive tasks using reinforcement learning environments. Check out how current models fare on our leaderboard. You can read more about BALROG in our paper.

Features

  • Comprehensive evaluation of agentic abilities
  • Support for both language and vision-language models
  • Integration with popular AI APIs and local deployment
  • Easy integration for custom agents, new environments and new models

Installation

We advise using conda for the installation

conda create -n balrog python=3.10 -y
conda activate balrog

git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install

Docker

We have provided some docker images. Please see the relevant README.

⚡️ Evaluate using vLLM locally

We support running LLMs/VLMs locally using vLLM. You can spin up a vLLM client and evaluate your agent on BALROG in the following way:

pip install vllm numpy==1.23
vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080

python eval.py \
  agent.type=naive \
  agent.max_image_history=0 \
  agent.max_history=16 \
  eval.num_workers=32 \
  client.client_name=vllm \
  client.model_id=meta-llama/Llama-3.2-1B-Instruct \
  client.base_url=http://0.0.0.0:8080/v1

Check out vLLM for more options on how to serve your models fast and efficiently.

🛜 Evaluate using popular APIs

We support out of the box clients for OpenAI, Anthropic and Google Gemini APIs. First set up your API key:

export OPENAI_API_KEY=<KEY>
export ANTHROPIC_API_KEY=<KEY>
export GEMINI_API_KEY=<KEY>

Then run the evaluation with:

python eval.py \
  agent.type=naive \
  agent.max_image_history=0 \
  eval.num_workers=64 \
  client.client_name=openai \
  client.model_id=gpt-4o-mini-2024-07-18

Documentation

We welcome contributions! Please see our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BALROG in any of your work, please cite:

@article{paglieri2024balrog,
  title={Benchmarking Agentic LLM and VLM Reasoning On Games},
  author={Paglieri, Davide and Cupia{\l}, Bart{\l}omiej and Coward, Sam and Piterbarg, Ulyana and Wo{\l}czyk, Maciej and Khan, Akbir and Pignatelli, Eduardo and Kuci{\'n}ski, {\L}ukasz and Pinto, Lerrel and Fergus, Rob and Foerster, Jakob Nicolaus and Parker-Holder, Jack and Rockt{\"a}schel, Tim},
  journal={arXiv preprint arXiv:2411.13543},
  year={2024}
}

About

Benchmarking Agentic LLM and VLM Reasoning On Games

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Other 0.6%