Code for ICLR 2023 paper Understanding Embodied Reference with Touch-Line Transformer.
Authors: Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, Yixin Zhu
Project_NAME/
├── process_masks_and_images_for_MAT.ipynb/
├── main_ref.py/
├── pretrained/
│ ├── 20_query_model.pth/
│ ├── best_etf.pth/
│ ├── best_arm.pth/
│ ├── best_np.pth/
│ └── best_ip.pth/
├── predictions/
│ ├── arm.csv/
│ ├── eye-to-fingertip.csv/
│ ├── inpaint.csv/
│ └── no_pose.csv/
└── yourefit
├── images/
├── pickle/
├── paf/
├── saliency/
├── inpaint_Place_using_expanded_masks/
├── eye_to_fingertip/
│ ├── eye_to_fingertip_annotations_train.csv/
│ ├── eye_to_fingertip_annotations_valid.csv/
│ ├── train_names.txt/
│ └── valid_names.txt.txt/
└── arm/
pretrained: a directory contains checkpoints.
pretrained/20_query_model.pth: we sliced (from 100 queries to 20 queries) the checkpoint of the checkpoint provided by the authors of MDETR
yourefit: a directory that contains the downloaded YouRefIt dataset. This directory will also contain inpainitings produced by readers. (Refer to the "inpainting" section for how to produce inpaintings).
yourefit/eye_to_fingertip: a directory containing annotations for eyes and fingertips
yourefit/arm: annotations for arms.
models/mdetr.py
conda create --name nvvc python=3.8
conda activate nvvc
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
- Download YouRefIt images and annotations as yourefit.zip
- unzip yourefit.zip outside of this project and get a folder named "yourefit"
- move or copy "images", "pickle", "paf", and "saliency" in the "yourefit" outside of this project into the existing "yourefit" folder inside this project
Use hyperlinks in checkpoint column of the table below, and put them into the directory named "pretrained", which is under project root (refer to the "project structure" section above)
Model | precision: IoU=0.25 | precision: IoU=0.50 | precision: IoU=0.75 | checkpoint |
---|---|---|---|---|
eye + fingertip | 0.7002398081534772 | 0.6250999200639489 | 0.3820943245403677 | best_etf.pth |
elbow joint + wrist | 0.6786570743405276 | 0.5971223021582733 | 0.34772182254196643 | best_arm.pth |
no explicit pose | 0.6370903277378097 | 0.5651478816946442 | 0.36211031175059955 | best_np.pth |
inpainting | 0.5787370103916867 | 0.5091926458832934 | 0.31414868105515587 | best_ip.pth |
MDETR | - | - | - | 20_query_model.pth |
We provide jupyter notebooks to expand the human masks required for inpainting,
Readers need to generate humans masks by themselves using F-RCNN because the
yourefit dataset does not include human masks. The mask generation process
is straightforward. Readers can refer to the github repo created by the authors
of F-RCNNs for how to generate human masks. We only provide notebooks
to expand and resize masks. Download by clicking the hyperlink.
process_masks_and_images_for_MAT.ipynb
After generating masks using the notebook, readers may, or may not, need to flip the values (e.g. change 255 to 0 and 0 to 255) in the output masks, depending on how readers generated humans masks using F-RCNN. After that, feed the masks and images to the model MAT for inpainting.\
After inpainting, readers may need to resize the inpaintings back to the
sizes of the original images because the input and output of MAT are squares.
If readers reshaped expanded masks to square (instead of masking them) before
feedings them into MAT, readers need to reshape the MAT output back to
original sizes. In contrast, if readers choose to mask, readers can process the
MAT outputs by cropping them. We only provide the notebook to reshape square
outputs back to the sizes of original images.
restore_inpaint_size.ipynb
Note that readers need to modify the image_dir, inpaint_dir, and output_dir in
the notebook provided above.
(image_dir is the path to yourefit images. the shape of the original images
will be used. inpaint dir is the path to the MAT outputs. output_dir
is the path to store the inpainted images that are reshaped to the sizes of
original images by the above notebook)
Finally, after obtaining inpaintings, change the INPAINT_DIR in magic_numbers.py to the path of the inpainted images that are reshaped to sizes of origianl images. Note that INPAINT_DIR is a relative path (relative to Project_NAME/yourefit. Please refer to the project structure section).
use the unmodified magic_numbers.py and run:
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_etf.pth --eval
before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_arm.pth --eval
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_np.pth --eval --pose False
(requires generated inpintings, see the optional "generate inpaintings" section in the "reproduction" section)
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
REPLACE_IMAGES_WITH_INPAINT = True
python main_ref.py --num_workers=1 --dataset_config configs/yourefit.json --batch_size 1 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/eval' --resume pretrained/best_ip.pth --eval --pose False
use the unmodified magic_numbers.py and run:
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64331 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_etf' --load pretrained/20_query_model.pth
before running, in unmodified magic_numbers.py, set:
REPLACE_ARM_WITH_EYE_TO_FINGERTIP = False
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64332 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_arm' --load pretrained/20_query_model.pth
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64333 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_np' --load pretrained/20_query_model.pth --pose False
(requires generated inpintings, see the optional "generate inpaintings" section in the "reproduction" section)
before running, in unmodified magic_numbers.py, set:
RESERVE_QUERIES_FOR_ARMS = False
NUM_RESERVED_QUERIES_FOR_ARMS = 0
REPLACE_IMAGES_WITH_INPAINT = True
python -m torch.distributed.launch --nproc_per_node=8 --master_port 64334 --use_env main_ref.py --num_workers 8 --dataset_config configs/yourefit.json --batch_size 7 --ema --text_encoder_lr 1e-4 --lr 5e-5 --output-dir 'output_dir/debug_ip' --load pretrained/20_query_model.pth --pose False
We provide jupyter notebooks to visualize predictions stored in csv files, which can be obtained by:
setting SAVE_EVALUATION_PREDICTIONS = True and run any of the evaluation command provided in the evaluation section above.
cleaned_visualize_predictions_eye_to_fingertip.ipynb
cleaned_visualize_predictions_elbow_joint_to_wrist.ipynb
cleaned_visualize_predictions_no-pose.ipynb
cleaned_visualize_predictions_inpaint.ipynb
We annotated eyes, fingertips, elbows, and wrists. They are under the yourefit folder of this repo. Eye and fingertip locations are stored in csv files. Elbows and wrist locations are stored in a json file.