TinyML and Efficient Deep Learning

Intelligent edge devices with rich sensors (e.g., billions of mobile phones and IoT devices) have been ubiquitous in our  lives. Deploying artificial intelligence (AI) to these edge devices is attractive for many real-world applications:  smart home, smart retail, smart manufacturing, autonomous driving, and more. However,  state-of-the-art deep learning systems typically require tremendous amount of resources (e.g., memory, computation, data,  AI experts), both for training and inference. This hinders the application powerful deep learning systems on edge devices.

Our TinyML project aims to improve the efficiency of deep learning  by requiring less computationfewer engineers, and fewer data, from algorithm to hardware, from inference to training. We embrace the era of edge AI and AIoT. 

PI: Prof. Song Han.

Figure 1: We aim to improve the efficiency of deep learning AI systems, including computation efficiency, engineer enfficiency and data efficiency. Our research enables powerful artificial intelligence (AI) algorithms run more efficiently on low-power mobile devices.
Figure 2: TinyML has a wide range of applications.

Full-Stack TinyML Projects

Efficient Deep Learning Algorithms

AutoML / Neural Architecture Search

Once for All: Train One Network and Specialize it for Efficient Deployment
Once-for-all (OFA) network supports diverse architectural settings by decoupling training and search, to reduce the cost of efficient inference across many devices and resource constraints, especially on edge devices.

MCUNet: Tiny Deep Learning on IoT Devices
MCUNet is a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight infer-ence engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers.

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware
ProxylessNAS can directly learn the architectures for large-scale target tasks and target hardware platforms. It addresses the high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set.

Model Compression

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy
APQ jointly searches the neural architecture, pruning policy, and quantization policy. To handle the larger design space, APQ transfers the knowledge from a fullprecision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor to improve the sample efficiency.

AMC: AutoML for Model Compression and Acceleration on Mobile Devices
AutoML for Model Compression (AMC) leverages reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor.

HAQ: Hardware-Aware Automated Quantization
Hardware-Aware Automated Quantization (HAQ) framework leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator’s feedback in the design loop.

Efficient Training

TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning
Tiny-Transfer-Learning (TinyTL) provides memory-efficient on-device learning by freezing the weights while only learns the bias modules to get rid of the intermediate activations, and introducing the lite residual module to maintain the adaptation capacity.

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training
Deep Gradient Compression (DGC) reduces the communication bandwidth in the large-scale distributed training via four techniques: momentum correction, local gradient clipping, momentum factor masking, and warm-up training.

Differentiable Augmentation for Data-Efficient GAN Training
Differentiable Augmentation (DiffAugment)  improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples, and effectively stabilizes training, and leads to better convergence.

Efficient Deep Learning Applications

3D Point Cloud Applications

Efficient and Robust LiDAR-Based End-to-End Navigation
Fast-LiDARNet achieves real-time performance on NVIDIA Jetson AGX Xavier. Together with Hybrid Evidential Fusion, it enables efficient and robust end-to-end LiDAR-based navigation (even in the case of occasional sensor failure).

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
SPVNAS enhances Point-Voxel Convolution in large-scale outdoor scenes with sparse convolutions. With 3D Neural Architecture Search (3D-NAS), it efficiently and effectively searches the optimal 3D neural network architecture under a given resource constraint.

Point-Voxel CNN for Efficient 3D Deep Learning
NeurIPS’19 Spotlight
PVCNN represents the 3D data in points to reduce the memory consumption, while performing the convolutions in voxels to reduce the irregular, sparse data access and improve the locality.

GAN Applications

Anycost GANs for Interactive Image Synthesis and Editing
For interactive natural image editing, Anycost GAN is trained to support elastic resolutions and channels for faster image generation at versatile speeds. Running subsets of the full generator produce outputs that are perceptually similar to the full generator, making them a good proxy for quick preview.

GAN Compression: Efficient Architectures for Interactive Conditional GANs
GAN Compression is a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs.

Differentiable Augmentation for Data-Efficient GAN Training
Differentiable Augmentation (DiffAugment)  improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples, and effectively stabilizes training, and leads to better convergence.

Natural Language Processing Applications

HAT Hardware-Aware Transformer NAS
HAT NAS framework leverages the hardware feedback in the neural architecture search loop, providing a most suitable model for the target hardware platform.

Lite Transformer with Long-Short Range Attention
Lite Transformer facilitates deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention).

MicroNet for Efficient Language Modeling
NLP Micronet experiments on weight pruning and quantization of lanauge models. It reduces the model size by more than 41x and won the champion of NeurIPS 2019 MicroNet efficient NLP challenge.

Video Applications

TSM: Temporal Shift Module for Efficient Video Understanding
Temporal Shift Module (TSM) can achieve the performance of 3D CNN but maintain 2D CNN’s complexity, by shifting part of the channels along the temporal dimension and thus facilitating information exchanged among neighboring frames.

Efficient Deep Learning Systems

Approaches introduced above are top-down solutions for efficient deep learning processing, where performance gains come from algorithmic optimization. However, these theoretical benefits (e.g., FLOPs and model size) cannot be easily converted into real improvements in measured speedup and energy efficiency. Therefore, specialized software/hardware systems are required to bridge the gap. On the other hand, these specialized software/hardware systems open up a new design space orthogonal to the algorithm space, bringing opportunities for holistic optimization by jointly optimizing the algorithm and software/hardware systems.

Efficient Software Libaries

IOS: Inter-Operator Scheduler for CNN Acceleration
Beyond intra-operator parallelization in existing deep learning frameworks such as TVM, Inter-Operator Scheduler (IOS) automatically schedules multiple operators’ parallel execution through a novel dynamic programming algorithm.

TinyEngine: Efficient Deep Learning on MCUs
MCUNet is a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight infer-ence engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers.

TorchSparse: Efficient Deep Learning on Point Clouds
TorchSparse is a high-performance computing library for efficient 3D sparse convolution. This library aims at accelerating sparse computation in 3D, in particular the Sparse Convolution operation.

Efficient Hardware

SpArch: Efficient Architecture for Sparse Matrix Multiplication
SpArch jointly optimizes the data locality for both input and output matrices in Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) via streaming-based merger, the condensed matrix representation and the Huffman tree scheduler.

SpAtten: Efficient Natural Language Processing
SpAtten is an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access.

AI-based Hardware Design

NAAS: Neural Accelerator Architecture Search
As a data-driven approach, NAAS holistically composes highly matched accelerator and neural architectures together with efficient compiler mapping.

GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning
GCN-RL Circuit Designer learns on the circuit topology representation with graph convolutional neural networks (GCN), and leverages reinforcement learning (RL) to transfer the knowledge between different technology nodes and topologies.

Latest Posts