A LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images
This release and the associated models were created in collaboration between the Robin team at AGI-Collective and Simon Ramstedt, with computing resources from Hessian-AI and OLCF.
As part of this first milestone and release we study the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), further improving capabilities by finetuning the vision encoder.
Available models
We use the following components:
- Base LLM: We explore using Vicuna, Mistral and OpenHermes-2.5
- Base Vision Model: We use the SigLIP model since it has shown stronger performance on vision benchmarks compared to CLIP
- We finetune the Vision Encoder hoping the next token prediction loss helps further improves the vision capabilities of the pretrained vision encoder
Model | Base LLM | GQA | SQA Text | SQA Image |
---|---|---|---|---|
liuhaotian/llava-v1.5-7b | lmsys/vicuna-7b-v1.5 | 62 | 70.43 | 66.8 |
liuhaotian/llava-v1.5-13b | lmsys/vicuna-7b-v1.5 | 63.3 | 71.6 | |
agi-collective/vicuna-7b-clip-finetune-lora | lmsys/vicuna-7b-v1.5 | 62.04 | 70.86 | 68.72 |
agi-collective/vicuna-7b-siglip-so400m-finetune-lora | lmsys/vicuna-7b-v1.5 | 56.79 | 68.76 | 67.48 |
agi-collective/mistral-7b-siglip-so400m-finetune-lora | mistralai/Mistral-7B-v0.1 | 49.44 | 73.66 | 68.57 |
agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora | teknium/OpenHermes-2.5-Mistral-7B | 53.59 | 78.17 | 72.73 |
agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora | teknium/OpenHermes-2.5-Mistral-7B | 54.48 | 79.56 | 74.22 |
(best 7B model results highlighted)
Authors
Daniel Z Kaplan1, Kshitij Gupta1, Simon Ramstedt1, Alexis Roger2, Edwin Fennell2, George Adamopoulos2, Quentin Anthony2, Sun Qi2, Andrew R Williams3, Prateek Humane3, Rishika Bhagwatkar3, Yuchen Lu3, Irina Rish4
1first author, 2second author, 3third author, 4PI
Citation
@misc{RobinV1,
author = {Daniel Z Kaplan, Kshitij Gupta, Simon Ramstedt, Alexis Roger, Edwin Fennell, George Adamopoulos, Quentin Anthony, Sun Qi, Andrew R Williams, Prateek Humane, Rishika Bhagwatkar, Yuchen Lu, Irina Rish},
title = {Robin - Visual Language Models},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0}},
commit = {tags/v1.0.0}
}
Acknowledgements
We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was in collaboration with representatives from EleutherAI. The code in this repo is based on github.com/haotian-liu/LLaVA