LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang*¹, Junxian Guo*¹³, Haotian Tang¹, Qinghao Hu¹, Guangxuan Xiao¹, Jiaming Tang¹, Yujun Lin¹, Zhijian Liu², Yao Lu², Song Han¹²
¹ MIT, ² NVIDIA, ³ SJTU
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via unified sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. For Llama-3-8B, LServe accelerates LLM prefilling by an average of 2.4x and decoding by up to 3.3x over the state-of-the-art serving systems, maintaining long-context accuracy.

Highlights

  • Efficient LLM inference engine for long sequences
  • Unified Sparse Attention framework for both prefilling and decoding stage
  • Hardware-aware hierarchical KV paging system
  • Locality utilization for avoiding overheads
  • Up to 3.3x speedup over state-of-the-art serving frameworks

Results

We evaluate LServe across diverse long-context benchmarks and models, demonstrating consistently superior throughput over existing LLM serving frameworks for long-sequence inference, without compromising accuracy.

Accuracy Evaluation

We evaluated LServe across a wide range of long-context benchmarks including LongBench and Needle-In-A-Haystack. Some of the evaluation results on Llama-3-8B are as follows.

Efficiency Benchmarks

Compared with the state-of-the-art serving systems, LServe demonstrates significant and consistent efficiency improvements across different GPU platforms and model architectures. On Llama-3-8B and Minitron-4B, LServe achieves 1.5× average speedup over vLLM. For MHA-based model Llama-2-7B, LServe runs more than 2.0× faster than baselines on average.

Benchmarking setting: We evaluate the decoding throughput across different sequence lengths for each model. The measured numbers were than normalized to 1 in the following figure. Benchmarks were conducted on NVIDIA A100 80G and L40S 48G GPUs.

Video

Citation

@article{yang2025lserve,
 title={LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention},
 author={Yang*, Shang and Guo*, Junxian and Tang, Haotian and Hu, Qinghao and Xiao, Guangxuan and Tang, Jiaming and Lin, Yujun and Liu, Zhijian and Lu, Yao and Han, Song},
 year={2025}
}

Media

No media articles found.

Acknowledgment

Team Members