VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu*, Zhuoyang Zhang*, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
Tsinghua University, MIT, NVIDIA, UC Berkeley, UC San Diego
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

Method

Multi-token in, Multi-token out Training and Inference

Visual inputs are tokenized into discrete tokens and concatenated with textual tokens to form a multi-modal token sequence. All tokens are involved in our next-token prediction process, enabling a unified training objective. During inference, the output tokens are decoded by our text detokenizer or vision tower decoder to yield multi-modal content.

Unified Foundation Vision Tower

Given input images, the features extracted by the vision encoder are discretized using residual quantization. Then the discrete vision features are meanwhile put into the vision decoder to reconstruct images and used to perform the text-image alignment. During this process, the reconstruction loss and contrastive loss are computed to update the vision tower, endowing it to produce discrete visual features with text alignment.

Result

Image-based Visual Language Benchmarks
Video-based Visual Language Benchmarks
FID on MJHQ30K

Demo

Image Understanding
Video Understanding
Image and Video Generation

Video

Citation

@article{wu2024vila,
 title={Vila-u: a unified foundation model integrating visual understanding and generation},
 author={Wu, Yecheng and Zhang, Zhuoyang and Chen, Junyu and Tang, Haotian and Li, Dacheng and Fang, Yunhao and Zhu, Ligeng and Xie, Enze and Yin, Hongxu and Yi, Li and others},
 journal={arXiv preprint arXiv:2409.04429},
 year={2024}
}

Media

No media articles found.

Acknowledgment

Team Members