Skip to yearly menu bar Skip to main content


Poster

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Zineng Tang · Ziyi Yang · MAHMOUD KHADEMI · Yang Liu · Chenguang Zhu · Mohit Bansal

Arch 4A-E Poster #314
Highlight Highlight
[ ] [ Paper PDF ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract:

We present CoDi-2, a Multimodal Large Language Model (MLLM) for learning in-context interleaved multi-modal representations. By aligning modalities with language for both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand modality- interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing, exemplar learning, composition, reasoning, etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation.

Chat is not available.