MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab
(* indicates equal contribution)


Waiting for more news.


No items found.

Competition Awards

No items found.


Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%) and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

Problem: Imbalanced Memory Distribution of CNNs

The memory distribution of CNNs is usually highly imbalanced, with the first several layers dominating the memory usage.

1. Save Memory with Patch-based Inference

We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs.

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x.

2. Receptive Field Redistribution to Reduce Computation Overhead

Patch-based inference leads to computation overhead since different patches are overlapped with each other. To reduce the overlapping, we propose to re-distribute the receptive field (RF) by reducing RF for the per-patch stage and increase RF for the later per-layer stage.

After redistribution, the computation overhead of MobileNetV2 with patch-based inference reduces from 10% to 3%, while the performance remains the same.

3. Joint Automated Optimization of Neural Architecture and Inference Scheduling.

Redistributing RF requires manual tuning. We employ neural architecture search techniques to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.

Experimental Results

  • ImageNet (higher accuracy at the same memory budgets)
  • VWW (smaller memory usage, higher accuracy)
  • WIDER Face (better performance at the same memory budgets)




title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},    

author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},    

booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},    





We thank MIT-IBM Watson AI Lab, Samsung, Woodside Energy, and NSF CAREER Award #1943349 for supporting this research.

Team Members