PockEngine: Sparse and Efficient Fine-tuning in a Pocket

Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
MIT, UCSD, MIT-IBM Watson AI Lab, NVIDIA
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. In this paper, we introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15x speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6x memory saving back-propagation (Jetson Orin). Remarkably, PockEngine enables fine-tuning LLaMA2-7B on NVIDIA Jetson Orin at 550 tokens/s, 7.9x faster than PyTorch.

Presentation

Why do we need on-device training?

  1. Privacy: Data never leaves the devices. This is especially important for sensitive enterprise data, such as copilot for coding.
  2. Customization: Models continuously adapt to new data.
  3. Low-Cost: There is no need to rent a cloud server. You can fine-tune LLM on your edge device.

However, most current DL frameworks are primarily designed for the cloud and lack optimization for edge devices. Therefore, we propose PockEngine, which offers the following features:

Sparse backpropagation

  • Sparse layer BP: skips updating unimportant layers
  • Sparse tensor BP: skips updating unimportant channels

Compiler support:

  • Removes pruned operators through dead code elimination
  • Shifts from runtime to compile-time: auto-diff, pruning, and graph optimizations
  • Enables inference-only frameworks to perform training

  • Enable on-device training on resource constrained devices.
  • 4.4x speedup compared to PyTorch on Jetson Orin and 15x on Raspberry Pi.

On-Device LLM Fine-Tuning:

  • PockEngine accelerated training (by 4.4x compared to PyTorch)
  • Sparse-BP speedups training while achieves comparable accuracy.

Qualitative Examples:

Video

Citation

@inproceedings{zhu2023pockengine,

title={PockEngine: Sparse and Efficient Fine-tuning in a Pocket},

author={Zhu, Ligeng and Hu, Lanxiang and Lin, Ji and Wang, Wei-Chen and Chen, Wei-Ming and Han, Song),

booktitle={IEEE/ACM International Symposium on Microarchitecture (MICRO)},

year={2023}

}

Media

Acknowledgment

This work was supported by MIT-IBM Watson AI Lab, MIT AI Hardware Program, MIT-Amazon Science Hub, and NSF. Ligeng Zhu and Ji Lin were partially supported by the Qualcomm Innovation Fellowship.

Team Members