SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhekai Zhang*, Hanrui Wang*, Song Han, William J. Dally
MIT, NVIDIA/Stanford
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner-product-based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer-product-based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, we propose an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of merger for larger sparse matrix, which reduces the memory accesses by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the memory accesses by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total memory access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP and ARM Armadillo respectively.

Methodology

SpArch jointly optimizes the input and output data reuse in outer-product based sparse matrix multiplication.

Compared to prior art OuterSPACE, SpArch could reduce DRAM access by 2.8x.

Architecture

SpArch uses a multiway merger to merge the partial matrices streamingly, as well as a row prefetcher to hide DRAM latency and improve input data reuse.

Results on SuiteSparse and SNAP datasets

SpArch has two to three orders of magnitude speedup and several hundreds of energy savings over general-purpose hardware platforms.

Video

Citation

@inproceedings{zhang2020sparch,

 title={Sparch: Efficient architecture for sparse matrix multiplication},

 author={Zhang, Zhekai and Wang, Hanrui and Han, Song and Dally, William J},

 booktitle={2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)},

 pages={261--274},

 year={2020},

 organization={IEEE}

}

Media

No media articles found.

Acknowledgment

Part of this work was supported under an NSF HDR Award #1934700 and DARPA SDH program. We thank MIT Data Science and AI Lab (DSAIL) for supporting this research.

Team Members