Projects

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner-product-based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer-product-based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, we propose an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of merger for larger sparse matrix, which reduces the memory accesses by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the memory accesses by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total memory access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo respectively.

SpArch: Efficient Architecture for Sparse Matrix Multiplication

HPCA 2020

(

)

Hardware Accelerator for Sparse Matrix-Matrix Multiplication (SpGEMM)

Lite Transformer with Long-Short Range Attention

ICLR 2020

(

)

Lite Transformer is an efficient mobile NLP architecture. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention).

Once-for-All: Train One Network and Specialize it for Efficient Deployment

ICLR 2020

(

)

OFA is an efficient AutoML technique that decouples model training from architecture search. Train only once, specialize for many hardware platforms, from CPU/GPU to hardware accelerators. OFA achieves a new SOTA 80.0% ImageNet top1 accuracy under the mobile setting (<600M FLOPs).

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

ACL 2020

(

)

HAT NAS framework leverages the hardware feedback in the neural architecture search loop, providing a most suitable model for the target hardware platform. The results on different hardware platforms and datasets show that HAT searched models have better accuracy-efficiency trade-offs.

Efficient AI Computing,Transforming the Future.

SpArch: Efficient Architecture for Sparse Matrix Multiplication

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Lite Transformer with Long-Short Range Attention

Lite Transformer with Long-Short Range Attention

Once-for-All: Train One Network and Specialize it for Efficient Deployment

Once-for-All: Train One Network and Specialize it for Efficient Deployment

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Efficient AI Computing,
Transforming the Future.