OmniFusion

A simple path to multimodal language models

Abstract

We introduce OmniFusion, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding

  1. Multimodal Instruct Data. OmniFusion is trained on a mixture of open sourced synthetic instruction-following data and high-quality image-language datasets for various downstream tasks.
  2. Model Architecture. To create our own multimodal model, we decided to take the best and most stable practices - a strong LLM, a powerful adapter, special tokens of new modalities, and multi-stage instructional learning with gradual unfreezing of the LLM.
  3. Performance. Our early experiments show that OmniFusion demonstrates competitive multimodal chat abilities, acting on par with multimodal GPT-4V on unseen images/instructions in many usecases and achieving benchmark results comparable to larger models.
  4. Open Source. We provided weights of trained model using Mistral as LLM backbone.

Changelog

  1. 10/04/2024 OmniFusion-1.1 weights are uploaded to Huggingface. Now the model can speak Russian :)
  2. 10/04/2024 Model training source code for OmniFusion-1.1 released
  3. 22/11/2023 OmniFusion weights are available on Huggingface for OmniFusion-1.1 released

OmniFusion Architecture

There are two architecture options for the OmniFusion model. The first option uses one visual encoder (CLIP ViT-L/14), the second uses two encoders (CLIP ViT-L/14 and DINO v2). OmniFusion connects pre-trained visual encoders and large language model, using trainable adapter.

We consider a two-stage instruction-tuning procedure:

  • Stage 1: Pre-training for Feature Alignment on pairs (image & caption). Only the adapter is training, based on a subset of CC3M and COCO.
  • Stage 2: Fine-tuning End-to-End on multimodal dialogues. Both the projection and LLM are updated using a mix of instructive data, the data consists of two parts: Russian-language and English-language dialogues. The dataset has the following structure:
    SFT Datasets
    Task Caption VQA WebQA OCRQA Conversation DocVQA Text-only SFT
    Dataset source ShareGPT4V COC, SAM-9K WebData TextVQA, OCRVQA LLaVA-v1.5-665k, OCRVQA Proprietary data (ru) Proprietary data (ru), Alpaca (en)
    #Samples 100K 20K, 9K 1.5K 120K 665K 20K 10K

painting_icon Performance

OmniFusion demonstrates competitive to state-of-the-art performance on a range of benchmarks and in subjective evaluation on downstream tasks.

Visual Dialog Performance
Model NDCG MRR Recall@1 Recall@5 Recall@10
OmniFusion 25.91 10.78 4.74 13.80 20.53
LLaVA-13B 24.74 8.91 2.98 10.80 18.02

OmniFusion-1.1 scores
Model textvqa scienceqa pope gqa ok_vqa
OmniFusion-1.1 (one encoder, Mistral) 0.4893 0.6802 0.7818 0.4600 0.5187
OmniFusion-1.1 (two encoders, Mistral) 0.4755 0.6732 0.8153 0.4761 0.5317

OmniFusion-1.1 (with proprietary GigaChat LLM) results on various benchmarks:

OmniFusion-1.1 Examples

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP.