Skip to yearly menu bar Skip to main content


Poster

LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Gongwei Chen · Leyang Shen · Rui Shao · Xiang Deng · Liqiang Nie

Arch 4A-E Poster #231
[ ] [ Paper PDF ]
[ Poster
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract: Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-$\textbf{L}$evel v$\textbf{I}$sual kn$\textbf{O}$wledge e$\textbf{N}$hanced Multimodal Large Language Model ($\textbf{LION}$), which empowers the MLLM by injecting visual knowledge in two levels. $\textbf{1)}$ $\textbf{Progressive}$ $\textbf{incorporation}$ $\textbf{of}$ $\textbf{fine-grained}$ $\textbf{spatial-aware}$ $\textbf{visual}$ $\textbf{knowledge}$. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. $\textbf{2)}$ $\textbf{Soft}$ $\textbf{prompting}$ $\textbf{of}$ $\textbf{high-level}$ $\textbf{semantic}$ $\textbf{visual}$ $\textbf{evidence}$. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model ($\textit{e.g.}$, improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Chat is not available.