Song Han

Associate Professor, MIT EECS

About

Song Han is an associate professor at MIT EECS. He earned his PhD from Stanford, pioneering efficient AI computing techniques such as “Deep Compression” (pruning, quantization) and the “Efficient Inference Engine,” which first introduced weight sparsity to modern AI chips, making it one of the top-5 most cited papers in the 50-year history of ISCA (1953-2023). His innovations, including TinyML and hardware-aware neural architecture search (Once-for-All Network), have advanced AI model deployment on resource-constrained devices.

His recent work on LLM quantization/acceleration (SmoothQuant, AWQ, StreamingLLM) has improved efficiency in LLM inference, adopted by NVIDIA TensorRT-LLM. Song received best paper awards at ICLR'16, FPGA'17, and MLSys'24, the NSF CAREER Award, “35 Innovators Under 35,” IEEE “AI’s 10 to Watch,” and the Sloan Research Fellowship. He co-founded DeePhi (now part of AMD) and OmniML (now part of NVIDIA) and developed the open lecture series EfficientML.ai to share advances in efficient ML research.

Graduated PhD students: Ji Lin (OpenAI), Hanrui Wang (assistant professor @UCLA), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google DeepMind).

Recent work: accelerating LLM and generative AI [slides]

  • LLM Quantization: AWQ, TinyChat received the best paper award at MLSys'24 for on-device LLM inference with 4bit quantization, having 6 million downloads on HuggingFace. [Github] SmoothQuant is a training-free and accuracy-preserving 8-bit post-training quantization (PTQ) solution for LLMs. [Github] QServe speeds up the large scale LLM serving with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). [Github]
  • Long Context LLM: StreamingLLM enables LLMs to generate infinite-length texts with a fixed memory budget by preserving the "attention sinks" in the KV-cache. [Github] Quest leverages query-aware sparsity in long-context KV cache to boost inference throughput. [Github] DuoAttention reduces both LLM's decoding and pre-filling memory and latency with retrieval and streaming heads. [Github]
  • Efficient Visual Generation: HART is an autoregressive visual generation model capable of directly generating 1024×1024 images on a laptop. SANA enables 4K image synthesis under low computation, using deep compression auto-encoder (DC-AE) and linear diffusion transformer. SVDQuant further enables 4-bit diffusion models (W4A4) by absorbing the outliers with low-rank components.
  • Efficient Visual Language Models: VILA, VILA-U, VILA^2, LongVILA are a family of efficient visual language models for both understanding and generation. VILA^2 augments VILA with VILA. LongVILA efficiently scales to 1K frames of video. [Github]

Research Interests

Teaching

  • Live Streaming:
  • Time:

    Tuesday/Thursday 3:35-5:00pm Eastern Time

  • Location:
    34-101
Lecture 1: Introduction
Lecture 12: Transformer and LLM (I)
Lecture 13: Transformer and LLM (II)
Lecture 16: Diffusion Model

Research

The incredible potential of large models in Artificial Intelligence Generated Content (AIGC), including cutting-edge technologies like Large Language Models (LLMs) and Diffusion Models, have revolutionized a wide range of applications, spanning natural language processing, content generation, creative arts, and more. However, large model size, and high memory and computational requirements present formidable challenges. We aim to tackle these hurdles head-on and make these advanced AI technologies more practical, democratizing access to these future-changing technologies for everyone.
Efficiency improvements in deep learning often start with refining algorithms, but these theoretical gains, like reducing FLOPs and model size, don't always easily lead to practical speed and energy savings. The demand arises for specialized hardware and software systems to bridge this gap. These specialized software and hardware systems create a fresh design dimension independent of the algorithm space. This opens up opportunities for holistic optimization by co-designing both the algorithm and the software/hardware systems.

Industry Impact

Our efficient ML research has influenced and landed in many industry products, thanks to the close collaboration with our sponsors: Intel OpenVino, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA FasterTransformer, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit,  ADI MAX78000/MAX78002 Model Training and Synthesis Tool, Ford Trailer Backup Assist.

Open source projects with over 1K GitHub stars:

Honors and Awards

6/13/2024
Best Paper Award
of
MLSys 2024
AWQ
7/15/2023
Best Demo Award
of
DAC University Demo
SpAtten
6/15/2023
Top 5 cited papers in 50 years of ISCA
of
EIE Retrospective
2023 Sloan Research Fellowship
5/1/2023
2022 Red Dot Award
5/1/2022
2021 Samsung Global Research Outreach (GRO) Award
5/1/2021
2021 NVIDIA Academic Partnership Award
5/1/2021
2020 NVIDIA Academic Partnership Award
5/1/2020
2020 IEEE "AIs 10 to Watch: The Future of AI" Award
5/1/2020
2020 NSF CAREER Award
5/1/2020
2020 SONY Faculty Award
5/1/2020
2019 MIT Technology Review list of 35 Innovators Under 35
5/1/2019
2019 Amazon Machine Learning Research Award
5/1/2019
2019 Facebook Research Award
5/1/2019
2018 SONY Faculty Award
5/1/2018
2018 Amazon Machine Learning Research Award
5/1/2018
5/15/2017
Best Paper Award
of
FPGA 2017
2017 SONY Faculty Award
5/1/2017
5/15/2016
Best Paper Award
of
ICLR 2016

Competition Awards

1st Place Award
,
ACM Quantum Computing for Drug Discovery Contest
,
, @
ICCAD 2023
,
2023
QuantumNAS
First Place (1/150)
,
ACM/IEEE TinyML Design Contest
,
Memory Occupation Track
, @
ICCAD
,
2022
HAT
First Price
,
6th AI Driving Olympics
,
nuScenes Semantic Segmentation
, @
ICRA
,
2021
SPVNAS
First Place
,
SemanticKITTI leaderboard
,
3D semantic segmentation
, @
ECCV
,
2020
SPVNAS
First Place
,
Low-Power Computer Vision Challenge
,
CPU Detection, FPGA
, @
CVPR
,
2020
OFA
First Place
,
Low-Power Computer Vision Workshop at ICCV 2019
,
DSP
, @
ICCV
,
2019
OFA
First Place
,
Low-Power Image Recognition Challenge
,
classification, detection
, @
IEEE
,
2019
OFA
First Place
,
Visual Wake Words Challenge
,
TF-lite track
, @
CVPR
,
2019
ProxylessNAS

News

  • Dec 2024
    A new blog post
    TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment
     is published.
    Explore the latest advancement in TinyChat – the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.
  • Dec 2024

    DistriFusion is integrated in NVIDIA's TensorRT-LLM for distributed inference on high-resolution image generation.

    DistriFusion

Contact

Email: FirstnameLastname [at] mit [dot] edu

If you work on efficient LLM, VLM, GenAI and are interested in joining my lab, please fill in the recruiting form. I do not reply inquiry emails if the recruiting form is incomplete.
PhD applicants: select "ML+System" track in the MIT PhD application system.