TinyML

Projects

LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

HPCA 2025

(

)

LEGO is an automatic RTL generator for AI accelerators.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024

(

)

Low-bit weight-only quantization for LLMs.

Tiny Machine Learning Projects

NeurIPS 2020/2021/2022, MICRO 2023, ICML 2023, MLSys 2024, IEEE CAS Magazine 2023

(

Feature

)

This TinyML project aims to enable efficient AI computing on the edge by innovating model compression techniques as well as high-performance system design.

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

MICRO 2023

(

)

This project introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality.

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

CVPR 2023

(

)

We present FlatFormer, an efficient ViT architecture for large-scale point cloud analysis.

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

ISCA 2023

(

)

EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators.

Blog Posts

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

November 7, 2024

A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU. Code: https://www.github.com/mit-han-lab/nunchaku

On-Device Training Under 256KB Memory

November 28, 2022

In MCUNetV3, we enable on-device training under 256KB SRAM and 1MB Flash, using less than 1/1000 memory of PyTorch while matching the accuracy on the visual wake words application. It enables the model to adapt to newly collected sensor data and users can enjoy customized services without uploading the data to the cloud thus protecting privacy.

Reducing the carbon footprint of AI using the Once-for-All network

July 3, 2020

“The aim is smaller, greener neural networks,” says Song Han, an assistant professor in the Department of Electrical Engineering and Computer Science. “Searching efficient neural network architectures has until now had a huge carbon footprint. But we reduced that footprint by orders of magnitude with these new methods.”

Auto Hardware-Aware Neural Network Specialization on ImageNet in Minutes

July 2, 2020

This tutorial introduces how to use the Once-for-All (OFA) Network to get specialized ImageNet models for the target hardware in minutes with only your laptop.

Efficiently Understanding Videos, Point Cloud and Natural Language on NVIDIA Jetson Xavier NX

May 22, 2020

Thanks to NVIDIA’s amazing deep learning eco-system, we are able to deploy three applications on Jetson Xavier NX soon after we receive the kit, including efficient video understanding with Temporal Shift Module (TSM, ICCV’19), efficient 3D deep learning with Point-Voxel CNN (PVCNN, NeurIPS’19), and efficient machine translation with hardware-aware transformer (HAT, ACL’20).