Jiajian Li* · Qi Wang* · Yunbo Wang · Xin Jin · Yang Li · Wenjun Zeng · Xiaokang Yang
⚡ Quick Start |
📥 Checkpoints Download |
📝 Citation
Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be "short-sighted", as they are typically trained on short snippets of imagined experiences. We argue that the primary challenge in open-world decision-making is improving the exploration efficiency across a vast state space, especially for tasks that demand consideration of long-horizon payoffs. In this paper, we present LS-Imgine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.
LS-Imagine is implemented and tested on Ubuntu 20.04 with python==3.9:
-
Create an environment
conda create -n ls_imagine python=3.9 conda activate ls_imagine
-
Install Java: JDK
1.8.0_171
. Then install the MineDojo environment and MineCLIP following their official documents. During the installation of MineDojo, various errors may occur.
Note
We provide the detailed installation process and solutions to common errors, please refer to here.
-
Install dependencies
pip install -r requirements.txt
-
Download the MineCLIP weight here and place them at
./weights/mineclip_attn.pth
. -
We provide two options for recording data during the training process: TensorBoard and Weights & Biases (wandb).
- To use TensorBoard, set
use_wandb
toFalse
in the./config.yaml
file. - To use wandb (optional), set
use_wandb
toTrue
in the./config.yaml
file. Additionally, retrieve your wandb API key and set it in the./config.yaml
file under the fieldwandb_key: {your_wandb_api_key}
.
- To use TensorBoard, set
We provide pretrained weights of LS-Imagine for the tasks mentioned in the paper. You can download them using the links in the table below and rename the downloaded file to latest.pt
:
Task Name | Weight File |
---|---|
harvest_log_in_plains | latest_log.pt |
harvest_water_with_bucket | latest_water.pt |
harvest_sand | latest_sand.pt |
mine_iron_ore | latest_iron.pt |
shear_sheep | latest_wool.pt |
To start a evaluating run from one of these checkpoints:
-
Set up the task for evaluation (instructions here).
-
Run the following command to test the success rate:
sh ./scripts/test.sh /path/to/latest.pt 100 test_harvest_log_in_plains
LS-Imagine mainly consists of two stages: fine-tuning a multimodal U-Net for generating affordance maps, learning world models and behaviors.
You can either set up custom tasks in MineDojo (instructions here) or use the task setups mentioned in our paper. LS-Imagine allows to start from any stage of the pipeline, as we provide corresponding checkpoint files for each stage to ensure flexibility.
-
Download the pretrained U-Net weights from here and save them to
./affordance_map/pretrained_unet_checkpoint/swin_unet_checkpoint.pth
. -
Set up the task (instructions here) and run the following command to collect data:
sh ./script/collect.sh your_task_name
-
Annotate the collected data using a method based on sliding bounding box scanning and simulated exploration to generate the fine-tuning dataset:
sh ./scripts/affordance.sh your_task_name your_prompt
-
Fine-tune the pretrained U-Net weights using the annotated dataset to generate task-specific affordance maps:
sh ./scripts/finetune_unet.sh your_task_name
-
After training, the fine-tuned multimodal U-Net weights for the specified task will be saved in
./affordance_map/model_out
.
Before starting the learning process for the world model and behavior, ensure you have obtained the multimodal U-Net weights. We provide the pretrained U-Net weights (link here) and the task-specific fine-tuned U-Net weights:
Task Name | Weight File |
---|---|
harvest_log_in_plains | swin_unet_checkpoint_log.pth |
harvest_water_with_bucket | swin_unet_checkpoint_water.pth |
harvest_sand | swin_unet_checkpoint_sand.pth |
mine_iron_ore | swin_unet_checkpoint_iron.pth |
shear_sheep | swin_unet_checkpoint_wool.pth |
You can download these weights using the links provided in the table below and place them at ./affordance_map/finetune_unet/finetune_checkpoints/{task_name}/swin_unet_checkpoint.pth
:
-
Set up the task and correctly configure the
unet_checkpoint_dir
to ensure the U-Net weights are properly located and loaded (instructions here). -
Run the following command to start training the world model and behavior:
sh ./scripts/train.sh your_task_name
After completing the training, the agent's weight file latest.pt
will be saved in the ./logdir
directory. You can evaluate the performance of LS-Imagine as mentioned in here.
If you find this repo useful, please cite our paper:
@inproceedings{li2025open,
title={Open-World Reinforcement Learning over Long Short-Term Imagination},
author={Jiajian Li and Qi Wang and Yunbo Wang and Xin Jin and Yang Li and Wenjun Zeng and Xiaokang Yang},
booktitle={ICLR},
year={2025}
}
The codes refer to the implemention of dreamerv3-torch and Swin-Unet. Thanks for the authors!