Skip to content

Latest commit

 

History

History
105 lines (79 loc) · 4.78 KB

README.md

File metadata and controls

105 lines (79 loc) · 4.78 KB

Tree Search for Language Model Agents

[Website] [Paper]

Overview

We propose an inference-time tree search algorithm to enable language model agents to perform exploration and multi-step planning in interactive web environments. This repository demonstrates how to run our method on the VisualWebArena and WebArena benchmarks.

TODOs

  • Add other options besides gpt-4o for the value function

News

  • [07/24/2024]: Released trajectories of the gpt-4o agent.
  • [06/19/2024]: GitHub repo released.

Install

# Python 3.10 or 3.11 recommended
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install
pip install -e .

End-to-end Evaluation on (V)WA

  1. Setup the standalone environments. Please check out this page for details.

  2. Configurate the urls for each website. First, export the DATASET to be visualwebarena:

export DATASET=visualwebarena

Then, set the URL for the websites

export CLASSIFIEDS="<your_classifieds_domain>:9980"
export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c"  # Default reset token for classifieds site, change if you edited its docker-compose.yml
export SHOPPING="<your_shopping_site_domain>:7770"
export REDDIT="<your_reddit_domain>:9999"
export WIKIPEDIA="<your_wikipedia_domain>:8888"
export HOMEPAGE="<your_homepage_domain>:4399"

If you want to run on the WebArena tasks instead, make sure to also set up the CMS, GitLab, and map environments, and then set their respective environment variables:

export DATASET=webarena
export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin"
export GITLAB="<your_gitlab_domain>:8023"
export MAP="<your_map_domain>:3000"
  1. Generate config files for each test example:
python scripts/generate_test_data.py

You will see *.json files generated in the config_files folder. Each file contains the configuration for one test example.

  1. Obtain and save the auto-login cookies for all websites:
bash prepare.sh
  1. Set up API keys.

If using OpenAI models, set a valid OpenAI API key (starting with sk-) as the environment variable:

export OPENAI_API_KEY=your_key
  1. Launch the evaluation. For example, to reproduce our GPT-4o + Search agent, you can run the script provided:
bash scripts/run_vwa_shopping_search.sh

This script will run the search agent with the default hyperparams from our paper on the full set of VWA shopping tasks. Note that the baselines that include a captioning model run on GPU by default (e.g., BLIP-2-T5XL as the captioning model will take up approximately 12GB of GPU VRAM). Similarly, the other bash scripts in scripts/ reproduce the results on the other VWA sites and the text-only WA environment.

By default, the scripts run experiments with the agents with search. If you wish to reproduce the baseline results (without search), set --agent_type prompt when executing run.py.

Running Llama-3 models

If you wish to run the Llama-3 models we have in our paper, first set up a vLLM OpenAI compatible server. Then, update the OPENAI_BASE_URL environment variable in scripts/run_llama_vwa_shopping_search.sh to reflect the URL that the vLLM server is running on. This particular script shows how to run the Llama-3 agent on the VWA shopping environment; it is otherwise very similar to the OpenAI scripts for running on the other environments.

Agent Trajectories

We release the agent trajectories and results of the gpt-4o agent (with gpt-4o as the reward function) here. They are saved in the same format specified in run.py.

Citation

If you methods or code useful, please consider citing our paper:

@article{koh2024tree,
  title={Tree Search for Language Model Agents},
  author={Koh, Jing Yu and McAleer, Stephen and Fried, Daniel and Salakhutdinov, Ruslan},
  journal={arXiv preprint arXiv:2407.01476},
  year={2024}
}

Acknowledgements

Our code is heavily based off the VisualWebArena codebase and the WebArena codebase.