Data-driven, automatic design space exploration of neural accelerator architecture is desirable for specialization and productivity. Previous frameworks focus on sizing the numerical architectural hyper-parameters while neglect searching the PE connectivities and compiler mappings. We push beyond searching only hardware hyper-parameters and propose the Neural Accelerator Architecture Search (NAAS), which fully exploits the hardware design space and compiler mapping strategies at the same time. Unlike prior work which formulate the hardware parameter search as a pure sizing optimization, NAAS models the co-search as a two-level optimization problem, where each level is a combination of indexing, ordering and sizing optimization. To tackle such challenges, we propose an encoding method which is able to encode the non-numerical parameters such as loop order and parallel dimension chosen as numerical parameters for optimization. Thanks to the low search cost, NAAS can be easily integrated with hardware-aware NAS algorithm by adding another optimization level, achieving the joint searching for neural network architecture, accelerator architecture and compiler mapping. Thus NAAS composes highly matched architectures together with efficient mapping. As a data-driven approach, NAAS rivals the human design Eyeriss by 4.4x EDP reduction with 2.7% accuracy improvement on ImageNet under the same computation resource, and offers 1.4x to 3.5x EDP reduction than only sizing the architectural hyper-parameters.
The overall design space can be categorized into three classes: accelerator space, compiler space, and neural-net design space. Some of these parameters are numerical values, such as array size. Some of these parameters are non-numerical values, such loop order and dataflow.
This table shows the correlation between neural and accelerator architecture, complicated and vary from accelerator to accelerator. A perfectly matched architectures will improve the utilization of the compute array and on-chip memories, maximizing the efficiency and performance.
(N is NVDLA and E is Eyeriss)
To achieve holistic optimization, we propose the Neural Accelerator Architecture Search, NAAS. For a specific workload, accelerator architecture search and NAS are conducted in one optimization loop to get tailored hardware and tailored neural net.
The convolution computation loop nest can be divided into two parts: temporal mapping and spatial parallelism. Loop tiling and loop ordering is reflected in the temporal mapping, and the hardware design can be inferred from the parallelism. Therefore, the PE connectivity can be modeled as the choices of parallel dimensions. For example, two parallel dimensions indicates an 2D array; parallelism in input channels (C) means a reduction connection of the partial sum accumulation register inside each PE. Parallelism in output channels means a broadcast to input feature register inside each PE.Hardware encoding vector contains two parts: architecture sizing and connectivity parameters, and the mapping encoding vector contains multiple parts, including loop orders for PE level and loop tiling for each array dimension level.
During the experiment, we found that the straight-forward method that uses index to encode the selection of parallel dimensions and loop order does not work well. It is because the increment or decrement of indexes does not convey any physical information. To solve this problem, we proposed the importance-based encoding method as follows.This strategy is interpretable, since the importance value represents the data locality of the dimension: the dimension labeled as most important has the best data locality since it is the outermost loop, while the dimension labeled as least important has the poorest data locality therefore it is the innermost loop.
@inproceedings{lin2020naas,
title={{NAAS: Neural Accelerator Architecture Search}},
author={Lin, Yujun and Yang, Mengtian and Han, Song},
booktitle={2021 58th ACM/ESDA/IEEE Design Automation Conference (DAC)},
year={2021}
}
This work was supported by SRC GRC program under task 2944.001. We also thank AWS Machine Learning Research Awards for the computational resource.