TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

News

2023-11-30
TorchSparse++ has been adopted by One-2-3-45++ from Prof. Hao Su's lab (UCSD) for 3D object generation!

2023-06-30
TorchSparse++ has been adopted by One-2-3-45 from Prof. Hao Su's lab (UCSD) for 3D mesh reconstruction!

Awards

No items found.

Competition Awards

No items found.

Abstract

Point cloud computation is important for AR/VR and ADAS. It involves sparse and irregular computation patterns, requiring specialized high-performance kernels. Existing GPU libraries offer two dataflow types for sparse point cloud convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this work we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse point cloud convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing point cloud libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training.

Project Resources

‍

TorchSparse (MLSys 2022): [Paper] [Slides] [Video] [Code]

TorchSparse++ (MICRO 2023): [Paper] [Slides] [Poster] [Code] [Video]

‍

Project Overview

‍

Sparse convolution plays a crucial role in a variety of cutting-edge applications, including augmented/virtual reality (AR/VR), autonomous driving, and recommendation systems.

In contrast to dense convolution, sparse convolution strategically bypasses both computation and storage when processing zero activations. This capability empowers it to efficiently manage input scenes with significantly larger spatial resolutions, yet it also introduces challenges in terms of system optimization.

It is evident that both the computation of matrix multiplication and the management of data movement incur significant costs in sparse convolution. In TorchSparse++, we improve the efficiency of sparse convolution on GPUs through pipelining and autotuning.

Sparse Kernel Generator

Our primary insight lies in recognizing that pipelined sparse convolution kernels deviate from conventional dense GEMM kernels by just a single pointer. This revelation empowers us to efficiently generate sparse convolution kernels using established tensor compilers, all while incurring minimal engineering overhead.

Sparse Autotuner

Our study reveals a similar trade-off between computation redundancy and computation regularity in the design of pipelined sparse convolution kernels. Devices with lower peak computational throughput tend to favor dataflows that exhibit reduced computation redundancy. Thus, it is advantageous to reduce computation overhead by partitioning the computation into more segments. In contrast, high-end devices prioritize the minimization of runtime for operations that are exclusively executed on the slower CUDA cores. Consequently, a dataflow configuration that optimizes computation regularity is the preferred choice for such devices.

We further design a group-based autotuner to encourage execution configurations that minimize mapping overhead.

Results

TorchSparse++ achieves up to 1.7x inference speedup and 1.3x training speedup compared with SpConv v2, the previous state-of-the-art system for the sparse convolution operator. It also compares favorably with PointAcc, the ASIC accelerator for point clouds designed by our group in MICRO 2021.

‍