Efficiently Understanding Videos, Point Cloud and Natural Language on NVIDIA Jetson Xavier NX

We are lucky to receive the Jetson Xavier NX developer kit from NVIDIA upon its early release. The user experience of the dev kit was fantastic. The box is shipped with two OS-installed SD cards, and after inserted and powered on, Jetson Xavier NX booted up smoothly. The interface is the same as the previous generation, which enables us to deploy our existing models and design new models very quickly.

Thanks to NVIDIA’s amazing deep learning eco-system, we are able to deploy three applications on Jetson Xavier NX soon after we receive the kit, including efficient video understanding with Temporal Shift Module (TSM, ICCV’19)[1], efficient 3D deep learning with Point-Voxel CNN (PVCNN, NeurIPS’19)[2], and efficient machine translation with hardware-aware transformer (HAT, ACL’20)[3]. Remarkably, we also designed an efficient transformer that is specialized for the hardware resources on NX, which achieved 2.6x speedup compared with the vanilla transformer on NX with no loss of accuracy on WMT’14 English to German translation task. Thanks to the seamless integration of training and inference GPUs, we are able to complete the hardware-aware transformer design within 4 days.

Jetson Xavier NX is very powerful compared with previous generations of Jetson devices. Due to the short amount of time, we only measured the fp32 performance without using quantization or DLA, which can bring further speedup.

Efficient Video Understanding

Efficient video understanding on edge devices is challenging but very important (e.g., in the health-care applications where privacy is important, patients’ data can not be transmitted to the cloud). TSM is a light-weight model for efficient video understanding on edge devices. TSM introduces a cheap temporal shift operation that contains zero FLOPs and zero parameters to perform temporal modeling. “2D Conv + temporal shift” has far less computation than 3D Conv, but can achieve even better accuracy.

Thanks to the powerful computation resource on Jetson Xavier NX, we can improve the processing speed of TSM (with ResNet-50 backbone, 8 frames/video) from 1.8 videos/s (on Jetson Nano) to 5.5 videos/s, making edge inference possible.

Temporal Shift Module can recognize gestures and control Google map interactively. The inference runs real-time on a Jetson Nano with TSM+MobileNet-v2 backbone. The project is open source: https://tsm-hanlab.mit.edu

Efficient 3D Deep Learning

3D neural networks are widely used in real-world applications (e.g., AR/VR headsets, self-driving cars). They are required to be fast and accurate; however, limited hardware resources on edge devices make these requirements rather challenging. We observe the main bottleneck of 3D deep learning is the cubically growing memory size and irregular memory access due to the sparsity. We introduce an efficient primitive, point-voxel CNN (PVCNN) that contains a point branch to reduce the memory footprint, and a voxel branch to reduce the irregular memory access. It is efficient and light-weight, very suitable for efficient LiDAR perception on edge devices.

PVCNN has been deployed on MIT Driverless for efficient LiDAR perception. https://pvcnn.mit.edu
PVCNN achieves better accuracy and detection range for efficient LiDAR perception. The project is open source: https://pvcnn.mit.edu

Thanks to the powerful computation resources on Jetson Xavier NX and our efficient 3D deep learning model (PVCNN), we can achieve higher frame rate that Jetson Nano and Jetson TX2.

The fastest version of our PVCNN runs at 140 FPS on Jetson Xavier NX, which is more than 2 times faster than TX2 and 4 times faster than Nano.

Efficient Natural Language Processing

Natural Language Processing (NLP) is the key technique in plenty of real-world applications such as machine translation, document summarization, and chatbot. However, the high accuracy of Transformer based NLP models comes at the cost of large model size and computation complexity, which makes it challenging to deploy on mobile devices. Therefore, we need hardware-efficient Transformers. We find two common pitfalls when evaluating the efficiency of a Transformer: (i) FLOPs does not reflect the real latency. (ii) Different hardware prefers different Transformer model architecture. Therefore, it is critical to involved hardware feedback when designing efficient models. We propose Hardware-Aware Transformer (HAT) [3] technique to efficiently search for fast models in a Transformer supernet with hardware latency feedback in the loop. Hardware-Aware Transformer achieves 2.7x speedup over the non-specialized Transformer baseline on Jetson NX.

To translate a sentence with 30 tokens, the Transformer baseline on NX can achieve 1.4 sentences/s, which is 2 times faster than Nano. We then designed a hardware-aware transformer (ACL’20) specialized for the hardware resource of Jetson NX, making it 2.7x faster than vanilla transformer on the same hardware platform. The project is open source: https://hat.mit.edu

We search an efficient HAT model that is specialized for the hardware resources on NX, which achieves 2.7x speedup compared with the vanilla transformer on NX, with no loss of accuracy on WMT’14 English to German translation task. Thanks to the seamless integration of NX with the GPU ecosystem, we are able to complete the hardware-aware transformer design within 4 days.


[1] Ji Lin, Chuang Gan, Song Han, “TSM: Temporal Shift Module for Efficient Video Understanding”, ICCV 2019.
[paper][demo][code][poster][MIT News][Engadget][MIT Technology Review][NVIDIA News][NVIDIA Jetson Developer Forum][industry integration: @NVIDIA@Baidu] [2] Zhijian Liu, Haotian Tang, Yujun Lin, Song Han, “Point-Voxel CNN for Efficient 3D Deep Learning”, NeurIPS 2019.
[paper][demos][talk][slides][code][website] [3] Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han, “HAT: Hardware-Aware Transformers for Efficient Natural Language Processing”, ACL 2020.