Skip to content

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents (ICLR 2024)

Notifications You must be signed in to change notification settings

lbaa2022/LLMTaskPlanning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Jae-Woo Choi1*, Youngwoo Yoon1*, Hyobin Ong1, 2, Jaehong Kim1, Minsu Jang1, 2 (*equal contribution)

1 Electronics and Telecommunications Research Institute, 2 University of Science and Technology

We introduce a system for automatically quantifying performance of task planning for home-service agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several extentions of the baseline planner.

Environment

Ubuntu 14.04+ is required. The scripts were developed and tested on Ubuntu 22.04 and Python 3.8.

You can use WSL-Ubuntu on Windows 10/11.

Install

  1. Clone the whole repo.

    $ git clone {repo_url}
  2. Setup a virtual environment.

    $ conda create -n {env_name} python=3.8
    $ conda activate {env_name}
  3. Install PyTorch (2.0.0) first (see https://pytorch.org/get-started/locally/).

    # exemplary install command for PyTorch 2.0.0 with CUDA 11.7
    $ pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 --index-url https://download.pytorch.org/whl/cu117
  4. Install python packages in requirements.txt.

    $ pip install -r requirements.txt

Benchmarking on ALFRED

Download ALFRED dataset.

$ cd alfred/data
$ sh download_data.sh json

Benchmarking

$ python src/evaluate.py --config-name=config_alfred

You can override the configuration. We used Hydra for configuration management.

$ python src/evaluate.py --config-name=config_alfred planner.model=EleutherAI/gpt-neo-125M
$ python src/evaluate.py --config-name=config_alfred alfred.x_display='1'
$ python src/evaluate.py --config-name=config_alfred alfred.eval_portion_in_percent=100 prompt.num_examples=18

Headless Server

Please run startx.py script before running ALFRED experiment on headless servers. Below script uses 1 for the X_DISPLAY id, but you can use different ids such as 0.

$ sudo python3 alfred/scripts/startx.py 1

Benchmarking on Watch-And-Help

Download the VirtualHome Simulator

  • Download the VirtualHome simulator v2.2.2 and extract it
$ cd {project_root}/virtualhome/simulation/unity_simulator/
$ wget http://virtual-home.org//release/simulator/v2.0/v2.2.2/linux_exec.zip
$ unzip linux_exec.zip

Benchmarking on Watch-And-Help-NL

  • Open a new terminal and run VirtualHome simulator
$ cd {project_root}
$ ./virtualhome/simulation/unity_simulator/linux_exec.x86_64
  • Open another terminal and evaluate.
$ cd {project_root}
$ python src/evaluate.py --config-name=config_wah
  • You can override the configuration. We used Hydra for configuration management.
$ cd {project_root}
$ python src/evaluate.py --config-name=config_wah planner.model_name=EleutherAI/gpt-neo-1.3B prompt.num_examples=10

Benchmarking on Watch-And-Help-NL Using Headless PC

  • Open a new terminal and run Xserver
$ cd {project}/virtualhome
$ sudo python helper_scripts/startx.py $display_num
  • Open another terminal and run unity simulator
$ cd {project}/virtualhome
$ DISPLAY=:$display_num ./simulation/unity_simulator/linux_exec.x86_64 -batchmode
  • Open another terminal and evaluate
$ cd {project_root}
$ python src/evaluate.py --config-name=config_wah_headless

Extensions

In-context example selection

$ python src/evaluate.py --config-name=config_wah prompt.select_method=same_task
$ python src/evaluate.py --config-name=config_wah prompt.select_method=topk

Replanning

$ python src/evaluate.py --config-name=config_alfred planner.use_predefined_prompt=True

Extract train samples from ALFRED for language model finetuning

Make sure you have preprocessed data (run ALFRED benchmarking at least once).

$ python src/alfred/exmaine_alfred_data.py

The output text resource resource/alfred_train_text_samples.txt can be used for finetuning.

WAH-NL Dataset

You can find the WAH-NL data, which is our extension of WAH, in ./dataset folder.

FAQ

  • Running out of disk space for Huggingface models

    • You can set the cache folder to be in another disk.
      $ export TRANSFORMERS_CACHE=/mnt/otherdisk/.hf_cache/
  • I have encountered 'cannot find X server with xdpyinfo' in running ALFRED experiments.

    • Please try another x_display number (this should be a string; e.g., '1') in the config file.
      $ python src/evaluate.py --config-name=config_alfred alfred.x_display='1'

Citation

@inproceedings{choi2024lota,
  title={LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents},
  author={Choi, Jae-Woo and Yoon, Youngwoo and Ong, Hyobin and Kim, Jaehong and Jang, Minsu},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

About

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents (ICLR 2024)

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •