Efficient AI Computing,
Transforming the Future.

Who We Are

Welcome to MIT HAN Lab! We focus on making AI faster, smarter, and more efficient. Our research covers a broad spectrum, including generative AI (e.g., LLMs and diffusion models), TinyML, system optimization and hardware design. By integrating algorithm and hardware expertise, we strive to push the frontiers of AI efficiency and performance.

Graduated PhD students: Ji Lin (OpenAI), Hanrui Wang (assistant professor @UCLA), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google DeepMind), Yujun Lin (NVIDIA Research).

Highlights

Accelerating LLM and Generative AI [slides]:

  • LLM Quantization: AWQ, TinyChat enables on-device LLM inference with 4bit quantization (best paper award at MLSys'24), with 19 million downloads on HuggingFace. SmoothQuant is a training-free and accuracy-preserving 8-bit post-training quantization (PTQ) solution for LLMs. QServe speeds up the large scale LLM serving with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). COAT enables memory efficient FP8 training.
  • Long Context LLM: StreamingLLM enables LLMs to generate infinite-length texts with a fixed memory budget by preserving the "attention sinks" in the KV-cache. Quest leverages query-aware sparsity in long-context KV cache to boost inference throughput. DuoAttention reduces both LLM's decoding and pre-filling memory and latency with retrieval and streaming heads. LServe accelerates long-context LLM serving with hardware-aware unified sparse attention framework.
  • Efficient Visual Generation: HART is an autoregressive visual generation model capable of directly generating 1024×1024 images on a laptop. SANA enables 4K image synthesis under low computation, using deep compression auto-encoder (DC-AE) and linear diffusion transformer. SVDQuant further enables 4-bit diffusion models (W4A4) by absorbing the outliers with low-rank components.
  • Efficient Visual Language Models: VILA, VILA-U, LongVILA are a family of efficient visual language models for both understanding and generation. LongVILA efficiently scales to 6K frames of video.

We Work On

The incredible potential of large models in Artificial Intelligence Generated Content (AIGC), including cutting-edge technologies like Large Language Models (LLMs) and Diffusion Models, have revolutionized a wide range of applications, spanning natural language processing, content generation, creative arts, and more. However, large model size, and high memory and computational requirements present formidable challenges. We aim to tackle these hurdles head-on and make these advanced AI technologies more practical, democratizing access to these future-changing technologies for everyone.

Efficient AI Algorithm
1
2

News

  • Feb 2024
    Congrats
    Hanrui Wang
     on
    Rising Star in Solid-State Circuits at ISSCC
    .
  • Feb 2024

    We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!

    AWQ
  • Jan 2024

    StreamingLLM is integrated by HPC-AI Tech SwiftInfer to support infinite input length for LLM inference.

    StreamingLLM
  • Dec 2023

    StreamingLLM is integrated by CMU, UW, and OctoAI, enabling endless and efficient LLM generation on iPhone!

    StreamingLLM
  • Dec 2023

    Congrats Ji Lin completed and defended his PhD thesis: "Efficient Deep Learning Computing: From TinyML to Large Language Model". Ji joined OpenAI after graduation.

Our Full-Stack Projects

To choose projects, simply check the boxes of the categories, topics and techniques.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

HPCA 2022
 (
oral
)

Design of Variational Quantum Algorithm Program

Network Augmentation for Tiny Deep Learning

ICLR 2022
 (
)

NetAug is a training technique for tiny neural networks. NetAug embeds the tiny neural networks into larger neural networks as a sub-network to get more guidance during training. NetAug consistently improves the performance of tiny models, achieving up to 2.2% accuracy improvement on ImageNet.

NAAS: Neural Accelerator Architecture Search

DAC 2021
 (
)

As a data-driven approach, NAAS holistically composes highly matched accelerator and neural architectures together with efficient compiler mapping.

Delayed Gradient Averaging: Tolerate the Communication Latency in Federated Learning

NeurIPS 2021
 (
)

We propose Delayed Gradient Averaging (DGA), which delays the averaging step to improve efficiency and allows local computation in parallel to communication.

Our Impacts

We actively collaborate with industry partners on efficient AI, model compression and acceleration. Our research has influenced and landed in many industrial products: Intel OpenVino, Intel Neural Network Distiller, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA TensorRT LLM, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Facebook PyTorch, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit,  ADI MAX78000/MAX78002 Model Training and Synthesis Tool.