Efficient AI Computing,
Transforming the Future.

Projects

To choose projects, simply check the boxes of the categories, topics and techniques.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025
 (
)

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MLSys 2025
 (
)

QServe accelerates large-scale LLM serving on GPUs with QoQ (W4A8KV4) quantization, boosting the generation throughputs by up to 3x over the state-of-the-art solution.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

ICLR 2025
 (
spotlight
)

A new W4A4 quantization paradigm for diffusion models.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ICLR 2025
 (
)

We propose COAT, a memory efficient FP8 training method for large language models.