Poster
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Haoning Wu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Annan Wang · Kaixin Xu · Chunyi Li · Jingwen Hou · Guangtao Zhai · Xue Geng · Wenxiu Sun · Qiong Yan · Weisi Lin
Arch 4A-E Poster #131
Multi-modality large language models (MLLMs), as represented by GPT-4V, have introduced a paradigm shift for visual perception and understanding tasks, that a variety of abilities can be achieved within one foundation model. While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g., clarity, brightness) to the evaluation on image quality, there's still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this, we collect the first dataset consisting of human natural language feedback on low-level vision. Each feedback offers a comprehensive description of an image's low-level visual attributes, culminating in an overall quality assessment. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18,973 multi-sourced images with diverse low-level appearance. To ensure MLLMs can adeptly handle diverse queries, we further propose a GPT-participated transformation to convert these feedbacks into a rich set of 200K instruction-response pairs, termed Q-Instruct. Experimental results indicate that the Q-Instruct consistently elevates various low-level visual capabilities across multiple base models. We anticipate that our datasets can pave the way for a future that foundation models can assist humans on low-level visual tasks.