Loading…
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
We present CoDi-2, a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with languagefor both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-contex...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We present CoDi-2, a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with languagefor both encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2, we build a large-scale generation dataset encompassing in-context multimodal instructions across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing, exemplar learning, composition, reasoning, etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation, vision transformation, and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation. |
---|---|
ISSN: | 2575-7075 |
DOI: | 10.1109/CVPR52733.2024.02589 |