Skip to content

remyxai/VQASynth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VQASynth ๐ŸŽน

Open In Colab

Try VQASynth on your image in the HF Space

GIF Description

Spatial Reasoning is fundamental to interacting within and navigating physical environments for embodied AI applications like robotics. However, data samples suitable for learning these capabilities are rare in AI pretraining datasets. Don't be limited by what your model can do out-of-the-box, curate any image dataset from the Huggingface Hub for Spatial VQA with tools for scene understanding.

VLMs trained using VQASynth ๐ŸŽน

  • estimate 3D distances between objects in an image
  • describe distances colloquially, convert between common units of measurement
  • answer queries about the orientation and spatial relationships between objects
  • base responses on consistent references like floors and surfaces

Description

Fusing semantic and metric data into templated VQA chat, Vision Language Models can be instruction-tuned with low-rank adapters to enhance their baseline spatial reasoning capabilities. VQASynth ๐ŸŽน provides an open-source reproduction of SpatialVLM, which describes a 3D scene reconstruction pipeline and prompt templates for enhancing the spatial reasoning abilities of VLMs including:

  • Semantic filtering with CLIP to normalize the image distribution and attributes
  • Metric Depth Estimation with ZoeDepth to lift the 2D image to 3D point cloud
  • Object-level captioning with FlexCap for precise 2D region proposal
  • Plane-fitting with RANSAC for consistent 3D reference coordinates

Initial VQASynth ๐ŸŽน pipelines prompted LLaVA for JSON-formatted object-level detailed captions or tags using RAM. Accordingly, we evaluated caption/tag based region proposal with publicly available models like CLIPSeg and groundingDINO.

VQASynth-diagram.png

What's New ๐Ÿ‘€ in VQASynth ๐ŸŽน

๐Ÿชถ Faster & lighter using Florence-2 for detailed image captions and region proposal grounded on text captions.

๐Ÿ“ Improves metric depth estimation speed & accuracy by replacing ZoeDepth with DepthPro.

๐ŸŽ“ SAM2 replaces SAM in the localization refinement stage.

Environment

Before running the demo scripts, ensure you have the following installed:

Run a Pipeline on Your Images

Use Docker Compose to transform Image datasets from Huggingface Hub into VQA datasets describing spatial relations between objects. You can process different datasets after updating the config.yaml.

Then run the spatial VQA pipeline locally with Docker:

# Authenticate to push to hub
huggingface-cli login

# Run the pipeline
cd /path/to/VQASynth
bash run.sh

You can run the colab notebook using free-tier CPU or GPU acceleration or customize your own pipeline:

from vqasynth.datasets import Dataloader
from vqasynth.embeddings import EmbeddingGenerator, TagFilter

dataloader = Dataloader(cache_dir)
dataset = dataloader.load_dataset(dataset_name)
embedding_generator = EmbeddingGenerator()
tag_filter = TagFilter()

include_tags = include_tags.strip().split(",")
exclude_tags = exclude_tags.strip().split(",")

# Extract embeddings
dataset = dataset.map(lambda example: embedding_generator.apply_transform(example, images))

# Extract tags
dataset = dataset.map(lambda example: tag_filter.apply_transform(example, include_tags + exclude_tags))

# Filter by tags
dataset_filtered = dataset.filter(
    lambda example: tag_filter.filter_by_tag(
        example['tag'], include_tags, exclude_tags
        )
    )

The resulting Huggingface dataset is in the cache directory and you can push to hub with:

dataloader.push_to_hub(final_dataset, target_repo_name)

Datasets from VQASynth ๐ŸŽน

Here are some examples:

sample_1 sample_2 sample_3
Does the red forklift in warehouse appear on the left side of the brown cardboard boxes stacked? How close is the man in red hat walking from the wooden pallet with boxes? Does the man in blue shirt working have a greater height compared to the wooden pallet with boxes on floor?
Incorrect, the red forklift in warehouse is not on the left side of the brown cardboard boxes stacked. The man in red hat walking is 60.13 centimeters from the wooden pallet with boxes. Indeed, the man in blue shirt working is taller compared to the wooden pallet with boxes on floor.

Models tuned on VQASynth ๐ŸŽน

Try SpaceLLaVA in Discord

image

Notebooks

We've hosted some notebooks visualizing and experimenting with the techniques included in this repo.

Notebook Description Launch
Generate Spatial VQA Dataset Augment an HF Image Dataset with Spatial VQA Open In Colab
Spatial Reasoning with Point Clouds Visualize point clouds and evaluate spatial relationships Open In Colab

References

This project was inspired by or utilizes concepts discussed in the following research paper(s):

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}