Quantization

Projects

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MLSys 2025

(

)

QServe accelerates large-scale LLM serving on GPUs with QoQ (W4A8KV4) quantization, boosting the generation throughputs by up to 3x over the state-of-the-art solution.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

A new W4A4 quantization paradigm for diffusion models.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ICLR 2025

(

)

We propose COAT, a memory efficient FP8 training method for large language models.

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

SANA is an efficient linear DiT that can generate images up to 4096 × 4096. SANA delivers: 20x smaller & 100x faster than FLUX; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024

(

)

Low-bit weight-only quantization for LLMs.

Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

February 21, 2025

SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku.

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

December 12, 2024

Explore the latest advancement in TinyChat – the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

November 7, 2024

A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU. Code: https://www.github.com/mit-han-lab/nunchaku

TinyChat: Visual Language Models & Edge AI 2.0

March 3, 2024

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

TinyChat: Large Language Model on the Edge

September 6, 2023

Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.