We introduce
There are two architecture options for the OmniFusion model. The first option uses one visual encoder (CLIP ViT-L/14), the second uses two encoders (CLIP ViT-L/14 and DINO v2). OmniFusion connects pre-trained visual encoders and large language model, using trainable adapter.
We consider a two-stage instruction-tuning procedure:
Task | Caption | VQA | WebQA | OCRQA | Conversation | DocVQA | Text-only SFT |
---|---|---|---|---|---|---|---|
Dataset source | ShareGPT4V | COC, SAM-9K | WebData | TextVQA, OCRVQA | LLaVA-v1.5-665k, OCRVQA | Proprietary data (ru) | Proprietary data (ru), Alpaca (en) |
#Samples | 100K | 20K, 9K | 1.5K | 120K | 665K | 20K | 10K |
OmniFusion demonstrates competitive to state-of-the-art performance on a range of
benchmarks and in subjective evaluation on downstream tasks.
Model
NDCG
MRR
Recall@1
Recall@5
Recall@10
OmniFusion
25.91
10.78
4.74
13.80
20.53
LLaVA-13B
24.74
8.91
2.98
10.80
18.02
Model | textvqa | scienceqa | pope | gqa | ok_vqa |
---|---|---|---|---|---|
OmniFusion-1.1 (one encoder, Mistral) | 0.4893 | 0.6802 | 0.7818 | 0.4600 | 0.5187 |
OmniFusion-1.1 (two encoders, Mistral) | 0.4755 | 0.6732 | 0.8153 | 0.4761 | 0.5317 |
OmniFusion-1.1 (with proprietary GigaChat LLM) results on various benchmarks:
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP.