Projects

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ICLR 2025

Guangxuan Xiao¹, Jiaming Tang¹, Jingwei Zuo², Junxian Guo¹ ³, Shang Yang¹, Haotian Tang¹, Yao Fu⁴, Song Han¹ ⁵

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks, referred to as Streaming Heads, do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU.

More Close

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ICLR 2025

(

)

By selectively applying full attention to critical attention heads and using "Streaming Attention" on others, DuoAttention significantly reduces both pre-filling and decoding memory usage and latency for long-context LLMs, while maintaining their long-context capabilities.

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

ICLR 2025

Junyu Chen*, Han Cai*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han

We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder.

More Close

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

ICLR 2025

(

)

A new family of high-spatial compression autoencoders for accelerating high-resolution diffusion models.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

ICLR 2025

Yukang Chen¹*, Fuzhao Xue¹*, Dacheng Li¹³*, Qinghao Hu²*, Ligeng Zhu¹, Xiuyu Li¹³, Yunhao Fang¹, Haotian Tang¹², Shang Yang¹², Zhijian Liu¹, Ethan He¹, Hongxu Yin¹, Pavlo Molchanov¹, Jan Kautz¹, Linxi Fan¹, Yuke Zhu¹⁴, Yao Lu¹, Song Han¹²

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

More Close

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

ICLR 2025

(

)

LongVILA is a full-stack solution for long VLM, which incorporates novel training strategies, dataset, and the Multi-Modal Sequence Parallelism (MM-SP) system to efficiently handle long video understanding, achieving significant scalability, accuracy, and speed improvements on multi-modal benchmarks.

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

ICLR 2025

Yecheng Wu*, Zhuoyang Zhang*, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

More Close

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

ICLR 2025

(

)

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation.

Efficient AI Computing,Transforming the Future.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Efficient AI Computing,
Transforming the Future.