Skip to yearly menu bar Skip to main content


Poster

On Scaling Up a Multilingual Vision and Language Model

Xi Chen · Josip Djolonga · Piotr Padlewski · Basil Mustafa · Soravit Changpinyo · Jialin Wu · Carlos Riquelme Ruiz · Sebastian Goodman · Xiao Wang · Yi Tay · Siamak Shakeri · Mostafa Dehghani · Daniel Salz · Mario Lučić · Michael Tschannen · Arsha Nagrani · Hexiang Hu · Mandar Joshi · Bo Pang · Ceslee Montgomery · Paulina Pietrzyk · Marvin Ritter · AJ Piergiovanni · Matthias Minderer · Filip Pavetic · Austin Waters · Gang Li · Ibrahim Alabdulmohsin · Lucas Beyer · Julien Amelot · Kenton Lee · Andreas Steiner · Yang Li · Daniel Keysers · Anurag Arnab · Yuanzhong Xu · Keran Rong · Alexander Kolesnikov · Mojtaba Seyedhosseini · Anelia Angelova · Xiaohua Zhai · Neil Houlsby · Radu Soricut

Arch 4A-E Poster #462
[ ] [ Paper PDF ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT

Abstract:

We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Chat is not available.