Skip to yearly menu bar Skip to main content


Poster

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

Zhi Gao · Yuntao Du. · Xintong Zhang · Xiaojian Ma · Wenjuan Han · Song-Chun Zhu · Qing Li

Arch 4A-E Poster #349
[ ] [ Project Page ] [ Paper PDF ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT

Abstract:

Leveraging large language models (LLMs) to integrate off-the-shelf tools (e.g., visual models and image processing functions) is a promising research direction to build powerful visual assistants for solving diverse visual tasks. However, the learning capability is rarely explored in existing methods, as they freeze the used tools after deployment, thereby limiting the generalization to new environments requiring specific knowledge. In this paper, we propose CLOVA, a Closed-LOop Visual Assistant to address this limitation, which encompasses inference, reflection, and learning phases in a closed-loop framework. During inference, LLMs generate programs and execute corresponding tools to accomplish given tasks. The reflection phase introduces a multimodal global-local reflection scheme to analyze whether and which tool needs to be updated based on environmental feedback. Lastly, the learning phase uses three flexible manners to collect training data in real-time and introduces a novel prompt tuning scheme to update the tools, enabling CLOVA to efficiently learn specific knowledge for new environments without human involvement. Experiments show that CLOVA outperforms tool-usage methods by 5\% in visual question answering and multiple-image reasoning tasks, by 10\% in knowledge tagging tasks, and by 20\% in image editing tasks, highlighting the significance of the learning capability for general visual assistants.

Chat is not available.