Who We Are

Welcome to MIT HAN Lab! We focus on making AI faster, smarter, and more efficient. Our research covers a broad spectrum, including generative AI (e.g., LLMs and diffusion models), TinyML, system optimization and hardware design. By integrating algorithm and hardware expertise, we strive to push the frontiers of AI efficiency and performance.

Graduated PhD students: Ji Lin (OpenAI), Hanrui Wang (assistant professor @UCLA), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google DeepMind), Yujun Lin (NVIDIA Research).

Highlights

Accelerating LLM and Generative AI [slides]:

LLM Quantization: AWQ, TinyChat enables on-device LLM inference with 4bit quantization (best paper award at MLSys'24), with 19 million downloads on HuggingFace. SmoothQuant is a training-free and accuracy-preserving 8-bit post-training quantization (PTQ) solution for LLMs. QServe speeds up the large scale LLM serving with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). COAT enables memory efficient FP8 training.

Long Context LLM: StreamingLLM enables LLMs to generate infinite-length texts with a fixed memory budget by preserving the "attention sinks" in the KV-cache. Quest leverages query-aware sparsity in long-context KV cache to boost inference throughput. DuoAttention reduces both LLM's decoding and pre-filling memory and latency with retrieval and streaming heads. LServe accelerates long-context LLM serving with hardware-aware unified sparse attention framework.
Efficient Visual Generation: HART is an autoregressive visual generation model capable of directly generating 1024×1024 images on a laptop. SANA enables 4K image synthesis under low computation, using deep compression auto-encoder (DC-AE) and linear diffusion transformer. SVDQuant further enables 4-bit diffusion models (W4A4) by absorbing the outliers with low-rank components.
Efficient Visual Language Models: VILA, VILA-U, LongVILA are a family of efficient visual language models for both understanding and generation. LongVILA efficiently scales to 6K frames of video.

We Work On

The incredible potential of large models in Artificial Intelligence Generated Content (AIGC), including cutting-edge technologies like Large Language Models (LLMs) and Diffusion Models, have revolutionized a wide range of applications, spanning natural language processing, content generation, creative arts, and more. However, large model size, and high memory and computational requirements present formidable challenges. We aim to tackle these hurdles head-on and make these advanced AI technologies more practical, democratizing access to these future-changing technologies for everyone.

Efficient AI Hardware & System

Efficient AI Algorithm

Efficient AI Hardware & System

Efficiency improvements in deep learning often start with refining algorithms, but these theoretical gains, like reducing FLOPs and model size, don't always easily lead to practical speed and energy savings. The demand arises for specialized hardware and software systems to bridge this gap. These specialized software and hardware systems create a fresh design dimension independent of the algorithm space. This opens up opportunities for holistic optimization by co-designing both the algorithm and the software/hardware systems.

1

2

News

May 2025
5/12/2025
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
appears at
to appear at
MLSys 2025
.
LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.
LServe Paper Code Slides Video

Apr 2025
4/26/2025
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
appears at
to appear at
ICLR 2025
.
A new W4A4 quantization paradigm for diffusion models.
SVDQuant Paper Code Slides Video

Apr 2025
4/25/2025
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
appears at
to appear at
ICLR 2025
.
We propose COAT, a memory efficient FP8 training method for large language models.
COAT Paper Code Slides Video

Apr 2025
4/25/2025
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
appears at
to appear at
ICLR 2025
.
SANA is an efficient linear DiT that can generate images up to 4096 × 4096. SANA delivers: 20x smaller & 100x faster than FLUX; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.
SANA Paper Code Slides Video

Apr 2025
4/25/2025
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
appears at
to appear at
ICLR 2025
.
By selectively applying full attention to critical attention heads and using "Streaming Attention" on others, DuoAttention significantly reduces both pre-filling and decoding memory usage and latency for long-context LLMs, while maintaining their long-context capabilities.
DuoAttention Paper Code Slides Video

Apr 2025
4/25/2025
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
appears at
to appear at
ICLR 2025
.
A new family of high-spatial compression autoencoders for accelerating high-resolution diffusion models.
DC-AE Paper Code Slides Video

Apr 2025
4/24/2025
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
appears at
to appear at
ICLR 2025
.
LongVILA is a full-stack solution for long VLM, which incorporates novel training strategies, dataset, and the Multi-Modal Sequence Parallelism (MM-SP) system to efficiently handle long video understanding, achieving significant scalability, accuracy, and speed improvements on multi-modal benchmarks.
LongVILA Paper Code Slides Video

Apr 2025
4/24/2025
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
appears at
to appear at
ICLR 2025
.
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation.
VILA-U Paper Code Slides Video

Apr 2025
4/24/2025
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
appears at
to appear at
ICLR 2025
.
HART is an autoregressive transformer that generates high resolution images with comparable quality to diffusion models, while offering 4.5-7.7x higher throughput.
HART Paper Code Slides Video

Mar 2025
3/4/2025
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications
appears at
to appear at
HPCA 2025
.
LEGO is an automatic RTL generator for AI accelerators.
LEGO Paper Code Slides Video

Our Full-Stack Projects

To choose projects, simply check the boxes of the categories, topics and techniques.

EIE: efficient inference engine on compressed deep neural network

ISCA 2016

(

)

We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

ICLR 2016

(

)

We introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy.

Learning both Weights and Connections for Efficient Neural Network

NIPS 2015

(

)

We describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections.

Our Impacts

We actively collaborate with industry partners on efficient AI, model compression and acceleration. Our research has influenced and landed in many industrial products: Intel OpenVino, Intel Neural Network Distiller, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA TensorRT LLM, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Facebook PyTorch, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit, ADI MAX78000/MAX78002 Model Training and Synthesis Tool.