Generative AI

Projects

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MLSys 2025

(

)

QServe accelerates large-scale LLM serving on GPUs with QoQ (W4A8KV4) quantization, boosting the generation throughputs by up to 3x over the state-of-the-art solution.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

A new W4A4 quantization paradigm for diffusion models.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ICLR 2025

(

)

We propose COAT, a memory efficient FP8 training method for large language models.

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

SANA is an efficient linear DiT that can generate images up to 4096 × 4096. SANA delivers: 20x smaller & 100x faster than FLUX; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

ICLR 2025

(

)

A new family of high-spatial compression autoencoders for accelerating high-resolution diffusion models.

Blog Posts

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

February 29, 2024

In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight. Code: https://github.com/mit-han-lab/distrifusion