Projects

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

ICCV 2025

Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai

We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT— a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and masked autoregressive models.

More Close

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

ICCV 2025

(

)

DC-AR is a high-efficiency masked AR framework for text-to-image generation, leveraging DC-HT—a hybrid tokenizer enabling 32x compression. It refines images via residual tokens, achieving remarkable results with 1.5–7.9x faster throughput and 2–3.5x lower latency than other leading models.

XAttention: Block Sparse Attention with Antidiagonal Scoring

ICML 2025

Ruyi Xu*, Guangxuan Xiao*, Haofeng Huang, Junxian Guo, Song Han

Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications.

More Close

XAttention: Block Sparse Attention with Antidiagonal Scoring

ICML 2025

(

)

A plug-and-play method that uses antidiagonal sums to efficiently identify important parts of the attention matrix, achieving up to 13.5x speedup on long-context tasks with comparable accuracy to full attention.

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

ArXiv 2025

Xingyang Li*, Muyang Li*, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with O(nlogn) complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard O(n2) dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9× speedup over the original dense attention. With minimal tuning, it enables video generation up to 4× longer while reducing training costs by up to 4.4× compared to direct fine-tuning and accelerating inference by up to 3.7× compared to dense attention inference.

More Close

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

ArXiv 2025

(

)

A O(nlogn) Sparse Attention Mask for Long Video Generation

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025

Shang Yang*¹, Junxian Guo*¹³, Haotian Tang¹, Qinghao Hu¹, Guangxuan Xiao¹, Jiaming Tang¹, Yujun Lin¹, Zhijian Liu², Yao Lu², Song Han¹²

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via unified sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. For Llama-3-8B, LServe accelerates LLM prefilling by an average of 2.4x and decoding by up to 3.3x over the state-of-the-art serving systems, maintaining long-context accuracy.

More Close

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025

(

)

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

Efficient AI Computing,Transforming the Future.

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

XAttention: Block Sparse Attention with Antidiagonal Scoring

XAttention: Block Sparse Attention with Antidiagonal Scoring

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Efficient AI Computing,
Transforming the Future.