OmniFusion

Abstract

We introduce OmniFusion, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding

Multimodal Instruct Data. OmniFusion is trained on a mixture of open sourced synthetic instruction-following data and high-quality image-language datasets for various downstream tasks.
Model Architecture. To create our own multimodal model, we decided to take the best and most stable practices - a strong LLM, a powerful adapter, special tokens of new modalities, and multi-stage instructional learning with gradual unfreezing of the LLM.
Performance. Our early experiments show that OmniFusion demonstrates competitive multimodal chat abilities, acting on par with multimodal GPT-4V on unseen images/instructions in many usecases and achieving benchmark results comparable to larger models.
Open Source. We provided weights of trained model using Mistral as LLM backbone.

Changelog

10/04/2024 OmniFusion-1.1 weights are uploaded to Huggingface. Now the model can speak Russian :)
10/04/2024 Model training source code for OmniFusion-1.1 released
22/11/2023 OmniFusion weights are available on Huggingface for OmniFusion-1.1 released

OmniFusion Architecture

There are two architecture options for the OmniFusion model. The first option uses one visual encoder (CLIP ViT-L/14), the second uses two encoders (CLIP ViT-L/14 and DINO v2). OmniFusion connects pre-trained visual encoders and large language model, using trainable adapter.

We consider a two-stage instruction-tuning procedure:

Stage 1: Pre-training for Feature Alignment on pairs (image & caption). Only the adapter is training, based on a subset of CC3M and COCO.

Stage 2: Fine-tuning End-to-End on multimodal dialogues. Both the projection and LLM are updated using a mix of instructive data, the data consists of two parts: Russian-language and English-language dialogues. The dataset has the following structure:

SFT Datasets
Task	Caption	VQA	WebQA	OCRQA	Conversation	DocVQA	Text-only SFT
Dataset source	ShareGPT4V	COC, SAM-9K	WebData	TextVQA, OCRVQA	LLaVA-v1.5-665k, OCRVQA	Proprietary data (ru)	Proprietary data (ru), Alpaca (en)
#Samples	100K	20K, 9K	1.5K	120K	665K	20K	10K

Performance

OmniFusion demonstrates competitive to state-of-the-art performance on a range of benchmarks and in subjective evaluation on downstream tasks.

Visual Dialog Performance
Model	NDCG	MRR	Recall@1	Recall@5	Recall@10
OmniFusion	25.91	10.78	4.74	13.80	20.53
LLaVA-13B	24.74	8.91	2.98	10.80	18.02

OmniFusion-1.1 scores
Model	textvqa	scienceqa	pope	gqa	ok_vqa
OmniFusion-1.1 (one encoder, Mistral)	0.4893	0.6802	0.7818	0.4600	0.5187
OmniFusion-1.1 (two encoders, Mistral)	0.4755	0.6732	0.8153	0.4761	0.5317

OmniFusion-1.1 (with proprietary GigaChat LLM) results on various benchmarks:

OmniFusion-1.1 Examples

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP.