On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. In this paper, we introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15x speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6x memory saving back-propagation (Jetson Orin). Remarkably, PockEngine enables fine-tuning LLaMA2-7B on NVIDIA Jetson Orin at 550 tokens/s, 7.9x faster than PyTorch.
However, most current DL frameworks are primarily designed for the cloud and lack optimization for edge devices. Therefore, we propose PockEngine, which offers the following features:
@inproceedings{zhu2023pockengine,
title={PockEngine: Sparse and Efficient Fine-tuning in a Pocket},
author={Zhu, Ligeng and Hu, Lanxiang and Lin, Ji and Wang, Wei-Chen and Chen, Wei-Ming and Han, Song),
booktitle={IEEE/ACM International Symposium on Microarchitecture (MICRO)},
year={2023}
}
This work was supported by MIT-IBM Watson AI Lab, MIT AI Hardware Program, MIT-Amazon Science Hub, and NSF. Ligeng Zhu and Ji Lin were partially supported by the Qualcomm Innovation Fellowship.