MIT HAN Lab

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

ICLR 2019

Han Cai, Ligeng Zhu, Song Han

Neural architecture search (NAS) has a great impact by automatically designing effective neural network architectures. However, the prohibitive computational demand of conventional NAS algorithms (e.g. 104 GPU hours) makes it difficult to directly search the architectures on large-scale tasks (e.g. ImageNet). Differentiable NAS can reduce the cost of GPU hours via a continuous representation of network architecture but suffers from the high GPU memory consumption issue (grow linearly w.r.t. candidate set size). As a result, they need to utilize proxy tasks, such as training on a smaller dataset, or learning with only a few blocks, or training just for a few epochs. These architectures optimized on proxy tasks are not guaranteed to be optimal on the target task. In this paper, we present ProxylessNAS that can directly learn the architectures for large-scale target tasks and target hardware platforms. We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set. Experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of directness and specialization. On CIFAR-10, our model achieves 2.08% test error with only 5.7M parameters, better than the previous state-of-the-art architecture AmoebaNet-B, while using 6× fewer parameters. On ImageNet, our model achieves 3.1% better top-1 accuracy than MobileNetV2, while being 1.2× faster with measured GPU latency. We also apply ProxylessNAS to specialize neural architectures for hardware with direct hardware metrics (e.g. latency) and provide insights for efficient CNN architecture design.

More Close

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

ICLR 2019

(

)

ProxylessNAS is an efficient hardware-aware neural architecture search method, which can directly search on large-scale datasets. It can design specialized neural network architecture for different hardware platforms. With >74.5% top-1 accuracy, the latency of ProxylessNAS is 1.8x faster than MobileNetV2.

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training

ICLR 2018

Yujun Lin¹, Song Han², Huizi Mao², Yu Wang¹, William J. Dally²³

Efficient AI Hardware & System

Efficient AI Algorithm

Training

Pruning & Sparsity

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.

More Close

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training

ICLR 2018

(

)

Deep Gradient Compression (DGC) reduces the communication bandwidth in the large-scale distributed training via four techniques: momentum correction, local gradient clipping, momentum factor masking, and warm-up training.

Learning to Design Circuits

NIPS 2019 MLSys Workshop

Hanrui Wang, Jiacheng Yang, Hae-Seung Lee, Song Han

Efficient AI Hardware & System

Efficient AI Algorithm

TinyML

ML for Hardware & System

New Architecture

Analog IC design relies on human experts to search for parameters that satisfy circuit specifications with their experience and intuitions, which is highly labor intensive, time consuming and suboptimal. Machine learning is a promising tool to automate this process. However, supervised learning is difficult for this task due to the low availability of training data: 1) Circuit simulation is slow, thus generating large-scale dataset is time-consuming; 2) Most circuit designs are propitiatory IPs within individual IC companies, making it expensive to collect large-scale datasets. We propose Learning to Design Circuits (L2DC) to leverage reinforcement learning that learns to efficiently generate new circuits data and to optimize circuits. We fix the schematic, and optimize the parameters of the transistors automatically by training an RL agent with no prior knowledge about optimizing circuits. After iteratively getting observations, generating a new set of transistor parameters, getting a reward, and adjusting the model, L2DC is able to optimize circuits. We evaluate L2DC on two transimpedance amplifiers. Trained for a day, our RL agent can achieve comparable or better performance than human experts trained for a quarter. It first learns to meet hard-constraints (eg. gain, bandwidth), and then learns to optimize good-to-have targets (eg. area, power). Compared with grid search-aided human design, L2DC can achieve higher sample efficiency with comparable performance. Under the same runtime constraint, the performance of L2DC is also better than Bayesian Optimization.

More Close

Learning to Design Circuits

NIPS 2019 MLSys Workshop

(

oral

)

We develop a reinforcement learning framework for analog circuit design.

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

ECCV 2018

Yihui He*, Ji Lin*, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han

Efficient AI Algorithm

TinyML

Pruning & Sparsity

Neural Architecture Search (NAS)

Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the hand-crafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.

More Close

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

ECCV 2018

(

)

AutoML for Model Compression (AMC) leverages reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor.

Efficient AI Computing,
Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training

Learning to Design Circuits

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training

Deep Gradient Compression: Reducing the Communication Bandwidth in Distributed Training

Learning to Design Circuits

Learning to Design Circuits

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,
Transforming the Future.