Projects

APQ: Joint Search for Nerwork Architecture, Pruning and Quantization Policy

CVPR 2020

Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, Song Han

We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized <model, accuracy> pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.

More Close

APQ: Joint Search for Nerwork Architecture, Pruning and Quantization Policy

CVPR 2020

(

)

APQ is an efficient AutoML framework for joint optimization of neural architecture, pruning, and quantization.

GAN Compression: Efficient Architectures for Interactive Conditional GANs

CVPR 2020 & TPAMI

Muyang Li¹, Ji Lin², Yaoyao Ding³, Zhijian Liu², Jun-Yan Zhu¹ and Song Han²

Conditional Generative Adversarial Networks (cGANs) have enabled controllable image synthesis for many computer vision and graphics applications. However, recent cGANs are 1-2 orders of magnitude more computationally-intensive than modern recognition CNNs. For example, GauGAN consumes 281G MACs per image, compared to 0.44G MACs for MobileNet-v3, making it difficult for interactive deployment. In this work, we propose a general-purpose compression framework for reducing the inference time and model size of the generator in cGANs. Directly applying existing CNNs compression methods yields poor performance due to the difficulty of GAN training and the differences in generator architectures. We address these challenges in two ways. First, to stabilize the GAN training, we transfer knowledge of multiple intermediate representations of the original model to its compressed model, and unify unpaired and paired learning. Second, instead of reusing existing CNN designs, our method automatically finds efficient architectures via neural architecture search (NAS). To accelerate the search process, we decouple the model training and architecture search via weight sharing. Experiments demonstrate the effectiveness of our method across different supervision settings (paired and unpaired), model architectures, and learning methods (e.g., pix2pix, GauGAN, CycleGAN). Without losing image quality, we reduce the computation of CycleGAN by more than 20× and GauGAN by 9×, paving the way for interactive image synthesis.

More Close

GAN Compression: Efficient Architectures for Interactive Conditional GANs

CVPR 2020 & TPAMI

(

)

A general-purpose compression framework for reducing the inference time and model size of the generator in conditional GANs.

SpArch: Efficient Architecture for Sparse Matrix Multiplication

HPCA 2020

Zhekai Zhang*, Hanrui Wang*, Song Han, William J. Dally

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner-product-based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer-product-based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, we propose an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of merger for larger sparse matrix, which reduces the memory accesses by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the memory accesses by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total memory access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo respectively.

More Close

SpArch: Efficient Architecture for Sparse Matrix Multiplication

HPCA 2020

(

)

Hardware Accelerator for Sparse Matrix-Matrix Multiplication (SpGEMM)

Lite Transformer with Long-Short Range Attention

ICLR 2020

Zhanghao Wu*, Zhijian Liu*, Ji Lin, Yujun Lin, Song Han

Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

More Close

Lite Transformer with Long-Short Range Attention

ICLR 2020

(

)

Lite Transformer is an efficient mobile NLP architecture. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention).

Efficient AI Computing,Transforming the Future.

APQ: Joint Search for Nerwork Architecture, Pruning and Quantization Policy

APQ: Joint Search for Nerwork Architecture, Pruning and Quantization Policy

GAN Compression: Efficient Architectures for Interactive Conditional GANs

GAN Compression: Efficient Architectures for Interactive Conditional GANs

SpArch: Efficient Architecture for Sparse Matrix Multiplication

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Lite Transformer with Long-Short Range Attention

Lite Transformer with Long-Short Range Attention

Efficient AI Computing,
Transforming the Future.