Accepted to ECCV 2022
Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world. While helping humans complete tasks, the agent may observe and process sensitive information of users, such as house environments, human activities, etc. In this work, we introduce privacypreserving embodied agent learning for the task of Vision-and-Language Navigation (VLN), where an embodied agent navigates house environments by following natural language instructions. We view each house environment as a local client, which shares nothing other than local updates with the cloud server and other clients, and propose a novel Federated Vision-and-Language Navigation (FedVLN) framework to protect data privacy during both training and pre-exploration. Particularly, we propose a decentralized federated training strategy to limit the data of each client to its local model training and a federated preexploration method to do partial model aggregation to improve model generalizability to unseen environments. Extensive results on R2R and RxR datasets show that, decentralized federated training achieves comparable results with centralized training while protecting seen environment privacy, and federated pre-exploration significantly outperforms centralized pre-exploration while preserving unseen environment privacy.
We release the reproducible code here.
Python requirements: Need python3.6
pip install -r python_requirements.txt
Please refer to this link to install Matterport3D simulator:
Download image features for environments for Envdrop model:
mkdir img_features
wget https://www.dropbox.com/s/o57kxh2mn5rkx4o/ResNet-152-imagenet.zip -P img_features/
cd img_features
unzip ResNet-152-imagenet.zip
Please download the CLIP-ViT features for CLIP-ViL models with this link:
wget https://nlp.cs.unc.edu/data/vln_clip/features/CLIP-ViT-B-32-views.tsv -P img_features
Please download the pre-processed data with link:
wget https://nlp.cs.unc.edu/data/vln_clip/RxR.zip -P tasks
unzip tasks/RxR.zip -d tasks/
For training Fed CLIP-ViL agent on RxR dataset, please run
name=agent_rxr_en_clip_vit_fedavg_new_glr2
flag="--attn soft --train listener
--featdropout 0.3
--angleFeatSize 128
--language en
--maxInput 160
--features img_features/CLIP-ViT-B-32-views.tsv
--feature_size 512
--feedback sample
--mlWeight 0.4
--subout max --dropout 0.5 --optim rms --lr 1e-4 --iters 400000 --maxAction 35
--if_fed True
--fed_alg fedavg
--global_lr 2
--comm_round 910
--local_epoches 5
--n_parties 60
"
mkdir -p snap/$name
CUDA_VISIBLE_DEVICES=2 python3 rxr_src/train.py $flag --name $name
Or you could simply run the script with the same content as above(we will use this in the following):
bash run/agent_rxr_clip_vit_en_fedavg.bash
bash agent_rxr_resnet152_fedavg.bash
Download Room-to-Room navigation data:
bash ./tasks/R2R/data/download.sh
Run the script:
bash run/agent_clip_vit_fedavg.bash
It will train the agent and save the snapshot under snap/agent/. Notice that we tried global learning rate schedular, which may help the training. Unseen success rate would be around 53%.
-
Train the speaker
bash run/speaker_clip_vit_fedavg.bash
It will train the speaker and save the snapshot under snap/speaker/
-
Augmented training:
After pre-training the speaker and the agnet,
bash run/bt_envdrop_clip_vit_fedavg.bash
It will load the pre-trained agent and train on augmented data with environmental dropout.
- Agent
bash run/agent_fedavg.bash
- Fed Speaker + Aug training
bash run/speaker_fedavg.bash bash run/bt_envdrop_fedavg.bash
After train the CLIP-ViL speaker, run
bash run/pre_explore_clip_vit_fedavg.bash
After train the resnet speaker, run
bash run/pre_explore_fedavg.bash
If you use FedVLN in your research or wish to refer to the baseline results published here, please use the following BibTeX entry.
@article{zhou2022fedvln,
title={FedVLN: Privacy-preserving Federated Vision-and-Language Navigation},
author = {Zhou, Kaiwen and Wang, Xin Eric},
journal={arXiv preprint arXiv:2203.14936},
year={2022}
}