Toolbox for querying vision-language models to guide task and motion planning

This is a guide to run example scripts for paper "Guiding Long-Horizon Task and Motion Planning with Vision Language Models" by Yang, et al. See project page or arXiv.

Set up dependencies

First, follow the main README in kitchen-worlds to install the dependencies.

Then run the following:

conda activate kitchen
pip install pddlgym anytree

Set up pretrained VLM APIs

Set your OpenAI api key or Anthropic key in the environment variable (e.g. in ~/.bashrc or ~/.zshrc):

export OPENAI_API_KEY=<openai_api_key>
export ANTHROPIC_API_KEY=<anthropic_api_key>

or put them in a text file inside the keys directory that will be ignored by git

cd vlm_tools; mkdir keys
echo <openai_api_key> > keys/openai_api_key.txt
echo <anthropic_api_key> > keys/anthropic_api_key.txt

Run an example solving open-world kitchen problems

Ask GPT4v to break down a high-level goal to a sequence of subgoals, then give them to PDDLStream in sequence.

cd pybullet_planning
python tutorials/test_vlm_tamp.py \
    --open_goal "make chicken soup" \
    --exp_subdir "test_fun" \
    --planning_mode "sequence"

Common args to the script:

--open_goal: Natural language description of the goal
--exp_subdir: Output will be saved in experiments/{exp_subdir}/{auto_datetime}_vlm-tamp/
--problem_name: Name to a python class in vlm_tools/problems_vlm_tamp.py that initiate the scene and problem, it initiates all objects that are required to solve the given open goal
--difficulty: Difficulty level of the task, which the scene builder function uses to determine how much movable and articulated obstacles to add, default=0
--dual_arm: action='store_true', whether to use dual arm or single arm of the PR2 robot
--planning_mode: Whether to use dual arm or single arm of the PR2 robot, choices=['sequence', 'actions', 'sequence-reprompt', 'actions-reprompt']
--load_llm_memory: Subpath inside kitchen-worlds/experiments/ in the format of {exp_subdir}/{auto_datetime}_vlm-tamp/ where the previous responses generated by the VLM is saved, e.g. test_run_vlm_tamp_pr2_chicken_soup/241106_212402_vlm-tamp

The output logs of previous runs can be viewed at http://0.0.0.0:9000/ by running the following in a different terminal.

(cd experiments/; python -m http.server 9000)

After the server is launched, the log of the last run can be viewed at http://0.0.0.0:9000/latest_run/log/.

References

Please cite the following paper if you use this code in your research:

@misc{yang2024guidinglonghorizontaskmotion,
      title={Guiding Long-Horizon Task and Motion Planning with Vision Language Models}, 
      author={Zhutian Yang and Caelan Garrett and Dieter Fox and Tomás Lozano-Pérez and Leslie Pack Kaelbling},
      year={2024},
      eprint={2410.02193},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.02193}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Toolbox for querying vision-language models to guide task and motion planning

Set up dependencies

Set up pretrained VLM APIs

Run an example solving open-world kitchen problems

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Toolbox for querying vision-language models to guide task and motion planning

Set up dependencies

Set up pretrained VLM APIs

Run an example solving open-world kitchen problems

References