Projects

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

ISLPED 2023

Zexi Ji, Hanrui Wang, Miaorong Wang, Win-San Khwa, Meng-Fan Chang, Song Han, Anantha P Chandrakasan

Efficient natural language processing on the edge is needed to interpret voice commands, which have become a standard way to interact with devices around us. Due to the tight power and compute constraints of edge devices, it is important to adapt the computation to the hardware conditions. We present a Transformer accelerator with a variable-depth adder tree to support different model dimensions, a SuperTransformer model from which Sub Transformers of various sizes can be sampled enabling adaptive model configuration, and a dedicated word elimination unit to prune redundant tokens. We achieve up to 6.9× scalability in network latency and energy between the largest and smallest Sub Transformers, under the same operating conditions. Word elimination can reduce network energy by 16%, with a 14.5% drop in F1 score. At 0.68V and 80MHz, processing a 32-length input with our custom 2-layer Transformer model for intent detection and slot filling takes 0.61ms and 1.6μJ.

More Close

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

ISLPED 2023

(

)

We design and tape-out the SpAtten architecture in TSMC 28nm technology digital chip.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2023

Guangxuan Xiao*¹, Ji Lin*¹, Mickael Seznec², Hao Wu², Julien Demouth², Song Han¹

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, and LLaMA families. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.

More Close

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2023

(

)

We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

CVPR 2023

Zhijian Liu*, Xinyu Yang*, Haotian Tang, Shang Yang, Song Han

Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks.

More Close

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

CVPR 2023

(

)

We present FlatFormer, an efficient ViT architecture for large-scale point cloud analysis.

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

ISCA 2023

Song Han¹³, Xingyu Liu⁴, Huizi Mao³ , Jing Pu⁵ , Ardavan Pedram²⁶ , Mark A. Horowitz² , William J. Dally²³

EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators. Since published in ISCA’16, it opened a new design space to accelerate pruned and sparse neural networks and spawned many algorithm-hardware co-designs for model compression and acceleration, both in academia and commercial AI chips. In retrospect, we review the background of this project, summarize the pros and cons, and discuss new opportunities where pruning, sparsity, and low-precision can accelerate emerging deep learning workloads.

More Close

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

ISCA 2023

(

)

EIE proposed to accelerate pruned and compressed neural networks, exploiting weight sparsity, activation sparsity, and 4-bit weight-sharing in neural network accelerators.

Efficient AI Computing,Transforming the Future.

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

Efficient AI Computing,
Transforming the Future.