Song Han is an associate professor at MIT EECS. He earned his PhD from Stanford, pioneering efficient AI computing techniques such as “Deep Compression” (pruning, quantization) and the “Efficient Inference Engine,” which first introduced weight sparsity to modern AI chips, making it one of the top-5 most cited papers in the 50-year history of ISCA (1953-2023). His innovations, including TinyML and hardware-aware neural architecture search (Once-for-All Network), have advanced AI model deployment on resource-constrained devices.
His recent work on LLM quantization/acceleration (SmoothQuant, AWQ, StreamingLLM) has improved efficiency in LLM inference, adopted by NVIDIA TensorRT-LLM. Song received best paper awards at ICLR'16, FPGA'17, and MLSys'24, the NSF CAREER Award, “35 Innovators Under 35,” IEEE “AI’s 10 to Watch,” and the Sloan Research Fellowship. He co-founded DeePhi (now part of AMD) and OmniML (now part of NVIDIA) and developed the open lecture series EfficientML.ai to share advances in efficient ML research.
Graduated PhD students: Ji Lin (OpenAI), Hanrui Wang (assistant professor @UCLA), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google DeepMind).
Recent work: accelerating LLM and generative AI [slides]
Generative AI models are significantly larger (>1000x) than traditional predictive AI, presenting new computational challenges. We innovated in key areas of quantization, parallelization, KV cache optimization, long-context learning, and multi-modal representation learning to minimize GenAI costs.
I pioneered the area of model compression that can shrink neural networks by >10x without hurting accuracy. By pruning, quantization, neural architecture search, we can fit neural networks in micro-controllers (MCUs). We also enable on-device training with 1000x less memory on MCUs.
Sparsity in neural networks arises where not all neurons are connected. I designed the first hardware accelerator EIE to exploit weight sparsity. I identify new sources of sparsity in modern AI: sparse attention, token pruning, point cloud, and implement efficient systems and accelerators to efficiently exploit sparsity.
Tuesday/Thursday 3:35-5:00pm Eastern Time
Our efficient ML research has influenced and landed in many industry products, thanks to the close collaboration with our sponsors: Intel OpenVino, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA FasterTransformer, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit, ADI MAX78000/MAX78002 Model Training and Synthesis Tool, Ford Trailer Backup Assist.
DistriFusion is integrated in NVIDIA's TensorRT-LLM for distributed inference on high-resolution image generation.
Email: FirstnameLastname [at] mit [dot] edu
If you work on efficient LLM, VLM, GenAI and are interested in joining my lab, please fill in the recruiting form. I do not reply inquiry emails if the recruiting form is incomplete.
PhD applicants: select "ML+System" track in the MIT PhD application system.