Molmo: Multimodal Open Language Model

Molmo is a repository for training and using Ai2's state-of-the-art multimodal open language models.

Here is a video demo of Molmo's capabilities. Try Molmo using our public demo showcasing the Molmo-7B-D model.

This codebase is based on the OLMo codebase with the addition of vision encoding and integrating generative evaluations.

Release Notes

[2024/12/05] 🔥 Molmo: code for modeling, training and evaluation has been released. You can find detailed technical report here.
[2024/11/27] 🔥 PixMo, our new collection of datasets for pre-training and fine-tuning VLMs, has been released. PixMo consists of:
- PixMo-Cap (pre-training, fine-tuning): highly detailed dense caption dataset (roughly 200 words on average)
- PixMo-AskModelAnything (fine-tuning): instruction-tuning data containing human-authored image-question-answer triplets
- PixMo-CapQA (fine-tuning): synthetic instruction-tuning data, using a LLM to build QA pairs from dense captions of images
- PixMo-Points (fine-tuning): images paired with referring expressions and annotated points, supporting grounding and counting
- PixMo-Point-Explanations (fine-tuning): instruction-tuning data with explanations containing in-line points referring to parts of the image
- PixMo-Docs (fine-tuning): synthetic image-question-answer triplets about various kinds of computer-generated charts, tables, diagrams and documents. Code available here.
- PixMo-Clocks (fine-tuning): virtual watch faces and time annotations
- PixMo-Count (fine-tuning): diverse images with counting QA pairs
All datasets were constructed without the use of VLMs.

Datasets in PixMo (left) and the capabilities they enable in Molmo (right).

[2024/09/24] 🔥 Molmo, a new family of open VLMs, has been released. The Molmo family consists of:
- MolmoE-1B: a mixture of experts model with 1B (active) 7B (total)
- Molmo-7B-O: our most open 7B model
- Molmo-7B-D: our best 7B and demo model
- Molmo-72B: our best 72B model

Installation

We recommend using python 3.10. First install PyTorch according to the instructions specific to your operating system.

To install dependencies, run:

git clone https://github.com/allenai/molmo.git
cd molmo
pip install -e .[all]

For training and evaluating MolmoE-1B, please install megablocks by running pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.

Huggingface Models and Logs

The core models in the Molmo family released so far are:

Model	Vision Encoder	LLM	11-benchmark avg
MolmoE-1B-0924	OpenAI CLIP ViT-L/14@336	OLMoE-1B-7B-0924	68.6
Molmo-7B-O-0924		OLMo-7B-1024-preview	74.6
Molmo-7B-D-0924		Qwen2-7B	77.3
Molmo-72B-0924		Qwen2-72B	81.2

W&B logs: pre-training, fine-tuning

Data Downloading and Setup

Molmo uses huggingface datasets for most data, therefore most data will be stored in the default huggingface cache. See here for how to set it. Some additional data is stored separately in the path set by MOLMO_DATA_DIR.

For example, if you want to store the data in /data/molmo you could set

export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface

Data can then be downloaded with:

python3 scripts/download.py all --n_proc 12

Downloading the pixmo datasets requires downloading images from URLs. The download script will do this automatically, but it will take some time. Downloading everything from scratch can take up to a day. More processes can make it faster, but it also increases the risk of getting rate-limited.

Downloading can be resumed if canceled or an error occurs mid-download.

Some datasets (InfoQa and Scene-Text) require manually downloading the files. The download scripts will throw an error if those files are not found.

Downloading the android control dataset requires additional dependencies since it requires parsing the original tfrecords.

To download a specific dataset pass in the dataset name run:

python3 scripts/download_data.py ChartQa --n_proc 12

Visualizing Data

Once downloaded, datasets can be visualized by using scripts/dataset_visualize.py script:

python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir

Trained Models

We release model weights both after pre-training and after fine-tuning in a format compatible with this codebase. The fine-tuned weights match the ones in the hugging face repos, but have a slightly different format. The config files are backwards-compatible with this repo, but also have a slightly different format.

Model	Pretrained	Fine-Tuned
MolmoE-1B-0924	pretrained	fine-tuned
Molmo-7B-O-0924	pretrained	fine-tuned
Molmo-7B-D-0924	pretrained	fine-tuned
Molmo-72B-0924	pretrained	fine-tuned

To use them, download the file and untar them. Each folder contains the needed config file and model weights. For example:

wget https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar
tar -xf Molmo-7B-D-0924.tar

Evaluation

Evaluation is done with the launch_scripts/eval_downstream.py script. FSDP can be used to evaluate large models, or for high-resolution processing. Note that the vLLM version of Molmo will be significantly faster for inference, but most of our numbers were reported using the results of this local evaluation.

To eval on a single task pass the task name, or task_name:split:

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 text_vqa --save_to_checkpoint_dir

For most tasks, we evaluate with high resolution:

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 text_vqa --save_to_checkpoint_dir --high_res --fsdp --device_batch_size=2

The --fsdp flag will use FSDP which is needed for to avoid OOMs when using high resolution.

To evaluate on our default eval set (including the 11 tasks in the paper):

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 low-res --save_to_checkpoint_dir
torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 high-res --save_to_checkpoint_dir --high_res --fsdp --device_batch_size=2

To get test numbers, use low-res-test and high-res-test. Some test numbers will require re-formatting the prediction files and then submitting to test servers.

To evaluate the 72B model with this codebase you will need to run on multiple nodes and might need to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

These scripts will save the metrics and predictions in the save directory. Future calls to the eval script will re-use cached metrics if they exist, to overwrite these cached metrics use the --overwrite flag.

Evaluation with VLMEvalkit

Evaluation of the HF models is also supported via open-compass/VLMEvalkit. Check PR#648 for supported prompts and evaluation settings to reproduce results from the paper. However a few datasets (e.g., PixMo-Count) are not supported.

Pretrained Models for Initialization

Training end-to-end requires downloading the pre-trained models used to initialize Molmo. This can be done with the script scripts/convert_hf_to_molmo.py

For example, to load the Qwen2 LLM and OpenAI CLIP model, run:

python3 scripts/convert_hf_to_molmo.py qwen2_7b
python3 scripts/convert_hf_to_molmo.py openai

The model will be downloaded from huggingface, converted into a compatible format, and then saved into the MOLMO_DATA_DIR directory.

Pre-Training

The main training script is scripts/train.py. To train a model you can either construct a config file to pass to it, or call one of the higher-level helper scripts in launch_scripts which will construct a low-level config from some higher-level settings and then invoke the train script for you.

To start a debugging run:

torchrun --nproc-per-node=1 launch_scripts/train_captioner.py debug --save_folder=/path/to/save/folder

To train with the Qwen2 LLM and the CLIP vision encoder:

WANDB_API_KEY=key torchrun --nproc-per-node=8 launch_scripts/train_captioner.py qwen2_7b --wandb.name=run_name --wandb.entity=entity --wandb.project=project --save_folder=/path/to/save/folder

You can use other vision encoders including SigLIP, MetaCLIP and DINOv2 with the option --vision_backbone=model_name.

To run without wandb, use:

torchrun --nproc-per-node=8 launch_scripts/train_captioner.py qwen2_7b --wandb=null --save_folder=/path/to/save/folder

Multitask Training

Multitask training can be done with launch_scripts/multtask_train.py, for example:

WANDB_API_KEY=key torchrun --nproc-per-node=8 launch_scripts/train_multitask_model.py 3.2-synthetic /path/to/checkpoint --wandb.name=run_name --wandb.entity=entity --wandb.project=project --save_folder=/path/to/save/folder

Here 3.2-synthetic refers to what training mixture to use and /path/to/checkpoint points to a model checkpoint to start from, typically a dense captioning model.

To launch a debug run:

torchrun --nproc-per-node=1 launch_scripts/train_multitask_model.py debug debug --save_folder=dbg --save_overwrite

Training Changes

There are minor differences between the published Molmo models that we trained and what this repo will produce.

Image URLs might fail to download, which will cause the amount of data to shrink slightly
PixMo-Clocks is not used by default, it requires a more complex download script that we are still considering how to port.

Multi-Node

Execute the torchrun commands on each node with the appropriate args should allow multi-node training or evaluation.

We recommend ensuring the data is downloaded and then using the environment variable HF_DATASETS_OFFLINE=1 to ensure the nodes don't flood HF with requests as they all initialize and then potentially get rate limited.

Citation

@article{molmo2024,
  title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models},
  author={Matt Deitke and Christopher Clark and Sangho Lee and Rohun Tripathi and Yue Yang and Jae Sung Park and Mohammadreza Salehi and Niklas Muennighoff and Kyle Lo and Luca Soldaini and Jiasen Lu and Taira Anderson and Erin Bransom and Kiana Ehsani and Huong Ngo and YenSung Chen and Ajay Patel and Mark Yatskar and Chris Callison-Burch and Andrew Head and Rose Hendrix and Favyen Bastani and Eli VanderBilt and Nathan Lambert and Yvonne Chou and Arnavi Chheda and Jenna Sparks and Sam Skjonsberg and Michael Schmitz and Aaron Sarnat and Byron Bischoff and Pete Walsh and Chris Newell and Piper Wolters and Tanmay Gupta and Kuo-Hao Zeng and Jon Borchardt and Dirk Groeneveld and Jen Dumas and Crystal Nam and Sophie Lebrecht and Caitlin Wittlif and Carissa Schoenick and Oscar Michel and Ranjay Krishna and Luca Weihs and Noah A. Smith and Hannaneh Hajishirzi and Ross Girshick and Ali Farhadi and Aniruddha Kembhavi},
  journal={arXiv preprint arXiv:2409.17146},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
launch_scripts		launch_scripts
olmo		olmo
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molmo: Multimodal Open Language Model

Release Notes

Installation

Huggingface Models and Logs

Data Downloading and Setup

Visualizing Data

Trained Models

Evaluation

Evaluation with VLMEvalkit

Pretrained Models for Initialization

Pre-Training

Multitask Training

Training Changes

Multi-Node

Citation

About

Contributors 3

Languages

License

allenai/molmo

Folders and files

Latest commit

History

Repository files navigation

Molmo: Multimodal Open Language Model

Release Notes

Installation

Huggingface Models and Logs

Data Downloading and Setup

Visualizing Data

Trained Models

Evaluation

Evaluation with VLMEvalkit

Pretrained Models for Initialization

Pre-Training

Multitask Training

Training Changes

Multi-Node

Citation

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages