XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu*, Guangxuan Xiao*, Haofeng Huang, Junxian Guo, Song Han
Massachusetts Institute of Technology, Tsinghua University, SJTU, NVIDIA
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications.

1. Highlights

  • Method for detecting important sparse-attention blocks  
  • Up to 13.5× acceleration in pre-fill attention computation  
  • Applicable to text understanding, video generation, and video understanding

2. Method

Core Idea: Antidiagonal Scoring Importance Prediction

Within each block of size B, we select elements along the antidiagonal using a stride S to predict importance of the whole block. Our aim is to optimize the following two objectives:

Information Preservation: Ensure that information from all tokens is considered, as each token contributes to at least one antidiagonal sum.

Pattern Detection: Antidiagonal intersects every possible vertical and slash pattern within a block.

3. Results

We evaluate X-attention across diverse long-context benchmarks and models, demonstrating consistently lower pre-fill latency than existing sparse-attention methods, without compromising accuracy.

3.1 Text Understanding

  • RULER: We compare the accuracy of XAttention against strong baselines on the Llama-3.1-8B-Instruct model across various sequence lengths on RULER.
  • LongBench: Maintaining the same configuration used for the RULER evaluation, we evaluate XAttention alongside MInference and FlexPrefill

3.2 Video Understanding

We apply Stride S = 16 and threshold τ = 0.9 parameters on the QwenVL-2-7B model.

3.3 Video Generation

  • We evaluate XAttention’s performance in the video generation domain using the HunyuanVideo model on prompts from VBench.
  • XAttention achieves high fidelity compared to videos generated with full attention, which is difficult for the human eye to discern.

3.4 Efficiency Benchmarks

  • Speedup comparison of attention methods across context lengths, relative to FlashInfer’s implementation of FlashAttention.
  • XAttention consistently outperforms other sparse attention methods, achieving up to 13.5x speedup at 256K tokens
  • XAttention reduces block-pattern-searching overhead —- achieving a 24.9 × speed-up over MInference and a 5.9 × speed-up over FlexPrefill on block selection stage.

Video

Citation

@inproceedings{xu2025xattention,
 title     = {XAttention: Block Sparse Attention with Antidiagonal Scoring},
 author    = {Xu, Ruyi and Xiao, Guangxuan and Huang, Haofeng and Guo, Junxian and Han, Song},
 booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
 year      = {2025}
}

Media

No media articles found.

Acknowledgment

We thank MIT-IBM Watson AI Lab, MIT and Amazon Science Hub, MIT AI Hardware Program, National Science Foundation, Hyundai, and Samsung for supporting this research. We thank NVIDIA for donating the DGX server.

Team Members