Efficient AI Algorithm

Projects

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

ICCV 2025

DC-AR is a high-efficiency masked AR framework for text-to-image generation, leveraging DC-HT—a hybrid tokenizer enabling 32x compression. It refines images via residual tokens, achieving remarkable results with 1.5–7.9x faster throughput and 2–3.5x lower latency than other leading models.

XAttention: Block Sparse Attention with Antidiagonal Scoring

ICML 2025

(

)

A plug-and-play method that uses antidiagonal sums to efficiently identify important parts of the attention matrix, achieving up to 13.5x speedup on long-context tasks with comparable accuracy to full attention.

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

ArXiv 2025

(

)

A O(nlogn) Sparse Attention Mask for Long Video Generation

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

MLSys 2025

(

)

QServe accelerates large-scale LLM serving on GPUs with QoQ (W4A8KV4) quantization, boosting the generation throughputs by up to 3x over the state-of-the-art solution.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

A new W4A4 quantization paradigm for diffusion models.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

ICLR 2025

(

)

We propose COAT, a memory efficient FP8 training method for large language models.

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

July 3, 2025

A sparse attention mechanism with O(nlogn) computational complexity for long video generation. It can speed up both training and inference by 2–4×. The code is available at https://github.com/mit-han-lab/radial-attention

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

February 21, 2025

SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku.

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

November 7, 2024

A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU. Code: https://www.github.com/mit-han-lab/nunchaku

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

March 10, 2024

In this blog, we introduce Patch Conv to reduce memory footprint when generating high-resolution images. PatchConv significantly cuts down the memory usage by over 2.4× compared to existing PyTorch implementation. Code: https://github.com/mit-han-lab/patch_conv

TinyChat: Visual Language Models & Edge AI 2.0

March 3, 2024

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

February 29, 2024

In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight. Code: https://github.com/mit-han-lab/distrifusion

TinyChat: Large Language Model on the Edge

September 6, 2023

Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.

On-Device Training Under 256KB Memory

November 28, 2022

In MCUNetV3, we enable on-device training under 256KB SRAM and 1MB Flash, using less than 1/1000 memory of PyTorch while matching the accuracy on the visual wake words application. It enables the model to adapt to newly collected sensor data and users can enjoy customized services without uploading the data to the cloud thus protecting privacy.

Reducing the carbon footprint of AI using the Once-for-All network

July 3, 2020

“The aim is smaller, greener neural networks,” says Song Han, an assistant professor in the Department of Electrical Engineering and Computer Science. “Searching efficient neural network architectures has until now had a huge carbon footprint. But we reduced that footprint by orders of magnitude with these new methods.”

Auto Hardware-Aware Neural Network Specialization on ImageNet in Minutes

July 2, 2020

This tutorial introduces how to use the Once-for-All (OFA) Network to get specialized ImageNet models for the target hardware in minutes with only your laptop.

Topics

Large Language Models (LLMs)

Techniques

Pruning & Sparsity

Data Augmentation

Quantization

Neural Architecture Search (NAS)

Distillation

New Architecture

ML for Hardware & System

Efficient AI Computing,
Transforming the Future.

Efficient AI Algorithm

Projects

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

XAttention: Block Sparse Attention with Antidiagonal Scoring

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

TinyChat: Visual Language Models & Edge AI 2.0

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

TinyChat: Large Language Model on the Edge

On-Device Training Under 256KB Memory

Reducing the carbon footprint of AI using the Once-for-All network

Auto Hardware-Aware Neural Network Specialization on ImageNet in Minutes

Categories

Topics

Techniques

Efficient AI Computing,Transforming the Future.

Efficient AI Algorithm

Projects

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

XAttention: Block Sparse Attention with Antidiagonal Scoring

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

TinyChat: Visual Language Models & Edge AI 2.0

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

TinyChat: Large Language Model on the Edge

On-Device Training Under 256KB Memory

Reducing the carbon footprint of AI using the Once-for-All network

Auto Hardware-Aware Neural Network Specialization on ImageNet in Minutes

Categories

Topics

Techniques

Efficient AI Computing,
Transforming the Future.