QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Yujun Lin*¹, Haotian Tang*¹, Shang Yang*¹, Zhekai Zhang¹, Guangxuan Xiao¹, Chuang Gan³⁴, Song Han¹²
¹ MIT, ² NVIDIA, ³ UMass Amherst, ⁴ MIT-IBM Watson AI Lab
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3×.

teaser

Highlights

  • Efficient LLM inference for both low-batch and large-batch
  • Combine the best of SmoothQuant and AWQ
  • Match TensorRT-LLM's performance on A100 with more cost-effective L40s GPUs
  • 4-bit KV cache compression
  • Code open sourced

Results

We evaluate QServe W4A8KV4 quantization on a wide range of mainstream LLMs. QServe consistently outperforms existing W4A4 or W4A8 solutions from the accuracy perspective, while providing State-of-the-Art LLM serving efficiency.

Accuracy Evaluation

Below is the WikiText2 perplexity evaluated with 2048 sequence length. The lower is the better.

Efficiency Benchmarks

When serving the large language models on L40S and A100 GPUs, QServe demonstrates superior performance, achieving 1.2x-1.4x higher throughput compared to the leading industry solution, TensorRT-LLM, for Llama-3-8B, and a 2.4x-3.5x higher throughput for Qwen1.5-72B. It is also able to deliver higher throughput and accomodate the same batch size on L40S compared with TensorRT-LLM on A100 for six of eight models benchmarked, effectively saving the dollar cost of LLM serving by around 3x.

Benchmarking setting: the criterion is maximum achieveable throughput on NVIDIA GPUs, and the input context length is 1024 tokens, output generation length is 512 tokens. For all systems that support paged attention, we enable this feature. In-flight batching is turned off in the efficiency benchmarks.

efficiency

Video

Citation

@article{lin2024qserve,
 title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
 author={Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
 journal={arXiv preprint arXiv:2405.04532},
 year={2024}
}

Media

No media articles found.

Acknowledgment

We thank MIT-IBM Watson AI Lab, MIT AI Hardware Program, Amazon, MIT Science Hub, and NSF for supporting this research. We also thank Julien Demouth, June Yang, and Dongxu Yang from NVIDIA for their helpful discussions.

Team Members