This repository is the official implementation of Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy.
Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. As illustrated in the figure below, GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks.
We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++ (see figure below), a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
See INSTALL.md for detailed instructions in installation.
The dataset can be found in Dropbox.
Put the dataset in the data/gembench
folder.
Dataset structure is as follows:
- data
- gembench
- train_dataset
- microsteps: 567M, initial configurations for each episode
- keysteps_bbox: 160G, extracted keysteps data
- keysteps_bbox_pcd: (used to train 3D-LOTUS)
- voxel1m: 10G, processed point clouds
- instr_embeds_clip.npy: instructions encoded by CLIP text encoder
- motion_keysteps_bbox_pcd: (used to train 3D-LOTUS++ motion planner)
- voxel1m: 2.8G, processed point clouds
- action_embeds_clip.npy: action names encoded by CLIP text encoder
- val_dataset
- microsteps: 110M, initial configurations for each episode
- keysteps_bbox_pcd:
- voxel1m: 941M, processed point clouds
- test_dataset
- microsteps: 2.2G, initial configurations for each episode
The RLBench-18task dataset (peract) can be downloaded here, following the same dataset structure as gembench.
Train the 3D-LOTUS policy end-to-end on the GemBench train split. It takes about 14h with a single A100 GPU.
sbatch job_scripts/train_3dlotus_policy.sh
The trained checkpoints are available here. You should put them in the folder data/experiments/gembench/3dlotus/v1
.
# both validation and test splits
sbatch job_scripts/eval_3dlotus_policy.sh
The evaluation script evaluates the 3D-LOTUS policy on the validation (seed100) and test splits of the GemBench benchmark.
The evaluation script skips any task that has already been evaluated before and whose results are already saved in data/experiments/gembench/3dlotus/v1/preds/
so make sure to clean it if you want to re-evaluate a task that you already evaluated.
We use the validation set to select the best checkpoint. The following script summarizes results on the validation split.
python scripts/summarize_val_results.py data/experiments/gembench/3dlotus/v1/preds/seed100/results.jsonl
The following script summarizes results on the test splits of four generalization levels:
python scripts/summarize_tst_results.py data/experiments/gembench/3dlotus/v1/preds 150000
sbatch job_scripts/train_3dlotus_policy_peract.sh
sbatch job_scripts/eval_3dlotus_policy_peract.sh
The trained checkpoints are available here. You should put them in the folder data/experiments/peract/3dlotus/v1
.
Download llama3-8B model following instructions here, and modify the configuration path in genrobo3d/configs/rlbench/robot_pipeline.yaml
.
Train the 3D-LOTUS++ motion planning policy on the GemBench train split. It takes about 14h with a single A100 GPU.
sbatch job_scripts/train_3dlotusplus_motion_planner.sh
The trained checkpoints are available here. . You should put them in the folder data/experiments/gembench/3dlotusplus/v1
We have three evaluation modes:
- groundtruth task planner + groundtruth object grounding + automatic motion planner
- groundtruth task planner + automatic object grounding + automatic motion planner
- automatic task planner + automatic object grounding + automatic motion planner
See comments in the following scripts:
# both validation and test splits
sbatch job_scripts/eval_3dlotusplus_policy.sh
If you use our GemBench benchmark or find our code helpful, please kindly cite our work:
@inproceedings{garcia24gembench,
author = {Ricardo Garcia and Shizhe Chen and Cordelia Schmid},
title = {Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy},
booktitle = {preprint},
year = {2024}
}