Skip to yearly menu bar Skip to main content


Poster

Improved Baselines with Visual Instruction Tuning

Haotian Liu · Chunyuan Li · Yuheng Li · Yong Jae Lee

Arch 4A-E Poster #209
Highlight Highlight
[ ] [ Project Page ] [ Paper PDF ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract:

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this paper, we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework. We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. Furthermore, we present some early exploration of open problems in LMMs, including scaling to higher resolution inputs, compositional capabilities, and model hallucination, etc. We hope this makes state-of-the-art LMM research more accessible. Code and model will be publicly available.

Chat is not available.