Who We Are

Welcome to MIT HAN Lab! We focus on making AI faster, smarter, and more efficient. Our research covers a broad spectrum, including generative AI (e.g., LLMs and diffusion models), TinyML, system optimization and hardware design. By integrating algorithm and hardware expertise, we strive to push the frontiers of AI efficiency and performance.

Graduated PhD students: Ji Lin (OpenAI), Hanrui Wang (assistant professor @UCLA), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google DeepMind), Yujun Lin (NVIDIA Research).

Highlights

Accelerating LLM and Generative AI [slides]:

LLM Quantization: AWQ, TinyChat enables on-device LLM inference with 4bit quantization (best paper award at MLSys'24), with 19 million downloads on HuggingFace. SmoothQuant is a training-free and accuracy-preserving 8-bit post-training quantization (PTQ) solution for LLMs. QServe speeds up the large scale LLM serving with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). COAT enables memory efficient FP8 training.

Long Context LLM: StreamingLLM enables LLMs to generate infinite-length texts with a fixed memory budget by preserving the "attention sinks" in the KV-cache. Quest leverages query-aware sparsity in long-context KV cache to boost inference throughput. DuoAttention reduces both LLM's decoding and pre-filling memory and latency with retrieval and streaming heads. LServe accelerates long-context LLM serving with hardware-aware unified sparse attention framework.
Efficient Visual Generation: HART is an autoregressive visual generation model capable of directly generating 1024×1024 images on a laptop. SANA enables 4K image synthesis under low computation, using deep compression auto-encoder (DC-AE) and linear diffusion transformer. SVDQuant further enables 4-bit diffusion models (W4A4) by absorbing the outliers with low-rank components.
Efficient Visual Language Models: VILA, VILA-U, LongVILA are a family of efficient visual language models for both understanding and generation. LongVILA efficiently scales to 6K frames of video.

We Work On

The incredible potential of large models in Artificial Intelligence Generated Content (AIGC), including cutting-edge technologies like Large Language Models (LLMs) and Diffusion Models, have revolutionized a wide range of applications, spanning natural language processing, content generation, creative arts, and more. However, large model size, and high memory and computational requirements present formidable challenges. We aim to tackle these hurdles head-on and make these advanced AI technologies more practical, democratizing access to these future-changing technologies for everyone.

Efficient AI Hardware & System

Efficient AI Algorithm

Efficient AI Hardware & System

Efficiency improvements in deep learning often start with refining algorithms, but these theoretical gains, like reducing FLOPs and model size, don't always easily lead to practical speed and energy savings. The demand arises for specialized hardware and software systems to bridge this gap. These specialized software and hardware systems create a fresh design dimension independent of the algorithm space. This opens up opportunities for holistic optimization by co-designing both the algorithm and the software/hardware systems.

1

2

News

Mar 2024
3/10/2024
A new blog post
Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D
is published.
In this blog, we introduce Patch Conv to reduce memory footprint when generating high-resolution images. PatchConv significantly cuts down the memory usage by over 2.4× compared to existing PyTorch implementation. Code: https://github.com/mit-han-lab/patch_conv

Mar 2024
3/3/2024
A new blog post
TinyChat: Visual Language Models & Edge AI 2.0
is published.
Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

Feb 2024
2/29/2024
A new blog post
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
is published.
In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight. Code: https://github.com/mit-han-lab/distrifusion

Feb 2024
2/24/2024
AWQ has been accepted to MLSys 2024!
AWQ

Feb 2024
2/18/2024
Congrats
Hanrui Wang
on
Rising Star in Solid-State Circuits at ISSCC
.

Feb 2024
2/13/2024
Our work StreamingLLM is covered by MIT News as spotlight!
StreamingLLM

Feb 2024
2/13/2024
StreamingLLM
is covered by
MIT News, MIT Homepage
: "
A new way to let AI chatbots converse all day without crashing
".

Feb 2024
2/1/2024
We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!
AWQ

Feb 2024
2/1/2024
We supported VILA Vision Languague Models in AWQ & TinyChat! Check our latest demos with multi-image inputs!
AWQ

Jan 2024
1/31/2024
🔥 AWQ has been integrated by Google Vertex AI!
‍
AWQ

Jan 2024
1/17/2024
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
appears at
to appear at
ICLR 2024
.
LongLoRA takes advantage of shifted sparse attention to greatly reduce the finetuning cost of long context LLMs.
LongLoRA Paper Code Slides Video

Jan 2024
1/17/2024
SmoothQuant is adopted by Microsoft ONNX Runtime.
SmoothQuant

Jan 2024
1/7/2024
StreamingLLM is integrated by HPC-AI Tech SwiftInfer to support infinite input length for LLM inference.
StreamingLLM

Jan 2024
1/2/2024
StreamingLLM is integrated into NVIDIA TensorRT-LLM!
StreamingLLM

Our Full-Stack Projects

To choose projects, simply check the boxes of the categories, topics and techniques.

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

NeurIPS 2021

(

)

In MCUNetV2, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.

PointAcc: Efficient Point Cloud Accelerator

MICRO 2021

(

)

PointAcc is a novel point cloud deep learning accelerator. It introduces a configurable sorting-based mapping unit that efficiently supports diverse operations in point cloud networks. PointAcc further exploits simplified caching and layer fusion specialized for point cloud models, effectively reducing the DRAM access.

SemAlign: Annotation-Free Camera-LiDAR Calibration with Semantic Alignment Loss

IROS 2021

(

)

Multi-sensor fusion is important in real-world robotics systems, but aligning different sensors through calibration is challenging and requires hours of human efforts. To this end, we propose SemAlign that does not require ground-truth calibration annotations and automates the process of camera-3D calibration.

Anycost GANs for Interactive Image Synthesis and Editing

CVPR 2021

(

)

Anycost GAN generates consistent outputs under various, fine-grained computation budgets.

Our Impacts

We actively collaborate with industry partners on efficient AI, model compression and acceleration. Our research has influenced and landed in many industrial products: Intel OpenVino, Intel Neural Network Distiller, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA TensorRT LLM, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Facebook PyTorch, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit, ADI MAX78000/MAX78002 Model Training and Synthesis Tool.