Projects

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

MICRO 2023

Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. In this paper, we introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15x speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6x memory saving back-propagation (Jetson Orin). Remarkably, PockEngine enables fine-tuning LLaMA2-7B on NVIDIA Jetson Orin at 550 tokens/s, 7.9x faster than PyTorch.

More Close

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

MICRO 2023

(

)

This project introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

ICCV 2023

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, Song Han

High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel lightweight multi-scale attention. Unlike prior high-resolution dense prediction models that rely on heavy self-attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our lightweight multi-scale attention achieves a global receptive field and multi-scale learning (two critical features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art high-resolution dense prediction models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 8.8x and 3.8x GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT provides up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR.

More Close

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

ICCV 2023

(

)

EfficientViT is a new family of vision models for high-resolution dense prediction. It achieves global receptive field and multi-scale learning with only hardware-efficient operations. EfficientViT delivers remarkable performance gains over previous models with speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU.

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

ISLPED 2023

Zexi Ji, Hanrui Wang, Miaorong Wang, Win-San Khwa, Meng-Fan Chang, Song Han, Anantha P Chandrakasan

Efficient natural language processing on the edge is needed to interpret voice commands, which have become a standard way to interact with devices around us. Due to the tight power and compute constraints of edge devices, it is important to adapt the computation to the hardware conditions. We present a Transformer accelerator with a variable-depth adder tree to support different model dimensions, a SuperTransformer model from which Sub Transformers of various sizes can be sampled enabling adaptive model configuration, and a dedicated word elimination unit to prune redundant tokens. We achieve up to 6.9× scalability in network latency and energy between the largest and smallest Sub Transformers, under the same operating conditions. Word elimination can reduce network energy by 16%, with a 14.5% drop in F1 score. At 0.68V and 80MHz, processing a 32-length input with our custom 2-layer Transformer model for intent detection and slot filling takes 0.61ms and 1.6μJ.

More Close

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

ISLPED 2023

(

)

We design and tape-out the SpAtten architecture in TSMC 28nm technology digital chip.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2023

Guangxuan Xiao*¹, Ji Lin*¹, Mickael Seznec², Hao Wu², Julien Demouth², Song Han¹

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, and LLaMA families. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.

More Close

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

ICML 2023

(

)

We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs.

Efficient AI Computing,Transforming the Future.

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

SpAtten-Chip: A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Efficient AI Computing,
Transforming the Future.