Skip to content

allenai/molmo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Molmo Logo

Molmo: Multimodal Open Language Model

GitHub License Blog Post Paper URL Model Checkpoints PixMo (Datasets)

Molmo is a repository for training and using Ai2's state-of-the-art multimodal open language models.

Here is a video demo of Molmo's capabilities. Try Molmo using our public demo showcasing the Molmo-7B-D model.

This codebase is based on the OLMo codebase with the addition of vision encoding and integrating generative evaluations.

Release Notes

  • [2024/12/05] 🔥 Molmo: code for modeling, training and evaluation has been released. You can find detailed technical report here.

  • [2024/11/27] 🔥 PixMo, our new collection of datasets for pre-training and fine-tuning VLMs, has been released. PixMo consists of:

    • PixMo-Cap (pre-training, fine-tuning): highly detailed dense caption dataset (roughly 200 words on average)
    • PixMo-AskModelAnything (fine-tuning): instruction-tuning data containing human-authored image-question-answer triplets
    • PixMo-CapQA (fine-tuning): synthetic instruction-tuning data, using a LLM to build QA pairs from dense captions of images
    • PixMo-Points (fine-tuning): images paired with referring expressions and annotated points, supporting grounding and counting
    • PixMo-Point-Explanations (fine-tuning): instruction-tuning data with explanations containing in-line points referring to parts of the image
    • PixMo-Docs (fine-tuning): synthetic image-question-answer triplets about various kinds of computer-generated charts, tables, diagrams and documents. Code available here.
    • PixMo-Clocks (fine-tuning): virtual watch faces and time annotations
    • PixMo-Count (fine-tuning): diverse images with counting QA pairs

    All datasets were constructed without the use of VLMs.

Pixmo and Molmo

Datasets in PixMo (left) and the capabilities they enable in Molmo (right).

  • [2024/09/24] 🔥 Molmo, a new family of open VLMs, has been released. The Molmo family consists of:

Installation

We recommend using python 3.10. First install PyTorch according to the instructions specific to your operating system.

To install dependencies, run:

git clone https://github.com/allenai/molmo.git
cd molmo
pip install -e .[all]

For training and evaluating MolmoE-1B, please install megablocks by running pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.

Huggingface Models and Logs

The core models in the Molmo family released so far are:

Model Vision Encoder LLM 11-benchmark avg
MolmoE-1B-0924 OpenAI CLIP ViT-L/14@336 OLMoE-1B-7B-0924 68.6
Molmo-7B-O-0924 OLMo-7B-1024-preview 74.6
Molmo-7B-D-0924 Qwen2-7B 77.3
Molmo-72B-0924 Qwen2-72B 81.2

W&B logs: pre-training, fine-tuning

Data Downloading and Setup

Molmo uses huggingface datasets for most data, therefore most data will be stored in the default huggingface cache. See here for how to set it. Some additional data is stored separately in the path set by MOLMO_DATA_DIR.

For example, if you want to store the data in /data/molmo you could set

export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface

Data can then be downloaded with:

python3 scripts/download.py all --n_proc 12

Downloading the pixmo datasets requires downloading images from URLs. The download script will do this automatically, but it will take some time. Downloading everything from scratch can take up to a day. More processes can make it faster, but it also increases the risk of getting rate-limited.

Downloading can be resumed if canceled or an error occurs mid-download.

Some datasets (InfoQa and Scene-Text) require manually downloading the files. The download scripts will throw an error if those files are not found.

Downloading the android control dataset requires additional dependencies since it requires parsing the original tfrecords.

To download a specific dataset pass in the dataset name run:

python3 scripts/download_data.py ChartQa --n_proc 12

Visualizing Data

Once downloaded, datasets can be visualized by using scripts/dataset_visualize.py script:

python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir

Trained Models

We release model weights both after pre-training and after fine-tuning in a format compatible with this codebase. The fine-tuned weights match the ones in the hugging face repos, but have a slightly different format. The config files are backwards-compatible with this repo, but also have a slightly different format.

Model Pretrained Fine-Tuned
MolmoE-1B-0924 pretrained fine-tuned
Molmo-7B-O-0924 pretrained fine-tuned
Molmo-7B-D-0924 pretrained fine-tuned
Molmo-72B-0924 pretrained fine-tuned

To use them, download the file and untar them. Each folder contains the needed config file and model weights. For example:

wget https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar
tar -xf Molmo-7B-D-0924.tar 

Evaluation

Evaluation is done with the launch_scripts/eval_downstream.py script. FSDP can be used to evaluate large models, or for high-resolution processing. Note that the vLLM version of Molmo will be significantly faster for inference, but most of our numbers were reported using the results of this local evaluation.

To eval on a single task pass the task name, or task_name:split:

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 text_vqa --save_to_checkpoint_dir

For most tasks, we evaluate with high resolution:

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 text_vqa --save_to_checkpoint_dir --high_res --fsdp --device_batch_size=2

The --fsdp flag will use FSDP which is needed for to avoid OOMs when using high resolution.

To evaluate on our default eval set (including the 11 tasks in the paper):

torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 low-res --save_to_checkpoint_dir
torchrun --nproc-per-node 8 launch_scripts/eval_downstream.py Molmo-7B-D-0924 high-res --save_to_checkpoint_dir --high_res --fsdp --device_batch_size=2

To get test numbers, use low-res-test and high-res-test. Some test numbers will require re-formatting the prediction files and then submitting to test servers.

To evaluate the 72B model with this codebase you will need to run on multiple nodes and might need to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

These scripts will save the metrics and predictions in the save directory. Future calls to the eval script will re-use cached metrics if they exist, to overwrite these cached metrics use the --overwrite flag.

Evaluation with VLMEvalkit

Evaluation of the HF models is also supported via open-compass/VLMEvalkit. Check PR#648 for supported prompts and evaluation settings to reproduce results from the paper. However a few datasets (e.g., PixMo-Count) are not supported.

Pretrained Models for Initialization

Training end-to-end requires downloading the pre-trained models used to initialize Molmo. This can be done with the script scripts/convert_hf_to_molmo.py

For example, to load the Qwen2 LLM and OpenAI CLIP model, run:

python3 scripts/convert_hf_to_molmo.py qwen2_7b
python3 scripts/convert_hf_to_molmo.py openai

The model will be downloaded from huggingface, converted into a compatible format, and then saved into the MOLMO_DATA_DIR directory.

Pre-Training

The main training script is scripts/train.py. To train a model you can either construct a config file to pass to it, or call one of the higher-level helper scripts in launch_scripts which will construct a low-level config from some higher-level settings and then invoke the train script for you.

To start a debugging run:

torchrun --nproc-per-node=1 launch_scripts/train_captioner.py debug --save_folder=/path/to/save/folder

To train with the Qwen2 LLM and the CLIP vision encoder:

WANDB_API_KEY=key torchrun --nproc-per-node=8 launch_scripts/train_captioner.py qwen2_7b --wandb.name=run_name --wandb.entity=entity --wandb.project=project --save_folder=/path/to/save/folder

You can use other vision encoders including SigLIP, MetaCLIP and DINOv2 with the option --vision_backbone=model_name.

To run without wandb, use:

torchrun --nproc-per-node=8 launch_scripts/train_captioner.py qwen2_7b --wandb=null --save_folder=/path/to/save/folder

Multitask Training

Multitask training can be done with launch_scripts/multtask_train.py, for example:

WANDB_API_KEY=key torchrun --nproc-per-node=8 launch_scripts/train_multitask_model.py 3.2-synthetic /path/to/checkpoint --wandb.name=run_name --wandb.entity=entity --wandb.project=project --save_folder=/path/to/save/folder

Here 3.2-synthetic refers to what training mixture to use and /path/to/checkpoint points to a model checkpoint to start from, typically a dense captioning model.

To launch a debug run:

torchrun --nproc-per-node=1 launch_scripts/train_multitask_model.py debug debug --save_folder=dbg --save_overwrite

Training Changes

There are minor differences between the published Molmo models that we trained and what this repo will produce.

  • Image URLs might fail to download, which will cause the amount of data to shrink slightly
  • PixMo-Clocks is not used by default, it requires a more complex download script that we are still considering how to port.

Multi-Node

Execute the torchrun commands on each node with the appropriate args should allow multi-node training or evaluation.

We recommend ensuring the data is downloaded and then using the environment variable HF_DATASETS_OFFLINE=1 to ensure the nodes don't flood HF with requests as they all initialize and then potentially get rate limited.

Citation

@article{molmo2024,
  title={Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models},
  author={Matt Deitke and Christopher Clark and Sangho Lee and Rohun Tripathi and Yue Yang and Jae Sung Park and Mohammadreza Salehi and Niklas Muennighoff and Kyle Lo and Luca Soldaini and Jiasen Lu and Taira Anderson and Erin Bransom and Kiana Ehsani and Huong Ngo and YenSung Chen and Ajay Patel and Mark Yatskar and Chris Callison-Burch and Andrew Head and Rose Hendrix and Favyen Bastani and Eli VanderBilt and Nathan Lambert and Yvonne Chou and Arnavi Chheda and Jenna Sparks and Sam Skjonsberg and Michael Schmitz and Aaron Sarnat and Byron Bischoff and Pete Walsh and Chris Newell and Piper Wolters and Tanmay Gupta and Kuo-Hao Zeng and Jon Borchardt and Dirk Groeneveld and Jen Dumas and Crystal Nam and Sophie Lebrecht and Caitlin Wittlif and Carissa Schoenick and Oscar Michel and Ranjay Krishna and Luca Weihs and Noah A. Smith and Hannaneh Hajishirzi and Ross Girshick and Ali Farhadi and Aniruddha Kembhavi},
  journal={arXiv preprint arXiv:2409.17146},
  year={2024}
}