Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.
Running large language models (LLMs) on the edge is of great importance. By embedding LLMs directly into real-world systems such as the copilot services (coding, smart reply and office) on laptops, in-car entertainment systems, vision-language assistants in robots or spaceship control interfaces, users can access instant responses and services without relying on a stable internet connection. Moreover, this approach alleviates the inconvenience of queuing delays often associated with cloud services. As such, running LLMs on the edge not only enhances user experience but also addresses privacy concerns, as sensitive data remains localized and reduces the risk of potential breaches.
Attractive as it sounds to be, deploying LLMs on the edge presents significant challenges. Edge devices operate under tight power constraints, setting them apart from workstations or cloud servers. This translates to restricted memory bandwidth and limited peak computation throughput on the edge. Compounding the challenge, edge devices have restricted memory capacity. For instance, the NVIDIA Jetson Orin Nano, with its 8GB DRAM, cannot accommodate even the most compact LLaMA-2 model in half precision. Thankfully, our TinyChat presents a push-the-button solution for weight quantization, empowering LLM inference on edge devices with constrained memory.
To understand the opportunity in edge LLM deployment and acceleration, we initiate our exploration with an analysis of workloads and bottlenecks. Enabling the operation of Large Language Models (LLMs) at the edge demands minimal latency. Consequently, opting for a unit batch size (where batch size = 1) is more favorable compared to batched inference. This also aligns with real-world scenarios. For instance, the laptop will typically be used by a single individual at any given time, and similarly, only one person would engage with the infotainment system in the car at a given moment. Therefore, we will continue our discussion based on the assumption of batch size = 1.
The inference process of the LLM can be divided into two stages: summarization and generation. During the summarization stage, the model applies self-attention across all prompt tokens. Thus, the computation pattern is primarily characterized by compute-bound matrix-matrix multiplications. In the generation stage, due to the existence of key-value (KV) caches, the query of each layer consists of a single token. This results in a computation pattern primarily dominated by memory-bound matrix-vector multiplications. In typical LLM workloads with 20-200 prompt tokens, the runtime for the entire summarization stage is even shorter than the time needed to generate a single token. As a consequence, the inference time of edge LLMs is bottlenecked by the generation stage, which is bounded by memory bandwidth.
Moreover, the cost of loading LLM weights on the edge surpasses that of loading activations. This is because for activations, the memory loading footprint scales with O(SC), where S represents the number of historical tokens and C is the number of channels. By comparison, the memory footprint for model weights follows O(C^2). In edge LLMs, the value of S tends to be in the hundreds, whereas C is an order of magnitude greater. Consequently, the memory bandwidth requirement is dominated by weight loading.
Accelerating LLM Inference on the Edge via SmoothQuant and AWQ
According to the roofline model, accelerating memory-bound workloads necessitates minimizing the data transfer between off-chip and on-chip memory. Since weight loading dominates the memory bandwidth, weight-only quantization can significantly reduce the amount of data transfer required. For example, quantizing model weights from FP16 to INT4 will bring about a nearly 4x reduction in DRAM traffic for weight loading. This quantization yields a theoretical potential for up to 4x acceleration.
Quantizing model weights to a 4-bit precision level without compromising performance is non-trivial. Thankfully, we have introduced a solution known as activation-aware weight-only quantization (AWQ) to effectively address this challenge. AWQ distinguishes crucial model weights by analyzing activation magnitude as opposed to the weights themselves. It subsequently implements scaling and clipping mechanisms to safeguard these pivotal weights, effectively mitigating quantization errors. Furthermore, AWQ does not require explicit optimization on calibration data, and thus generalizes better to vision-language models (VLMs) and instruction-tuned LLMs such as chatbots and personal assistants. It also helps to maintain accuracy on datasets with different distributions.
Furthermore, we introduce SmoothQuant, an efficient approach for quantizing both weights and activations in LLMs. Quantizing activations in LLMs poses a challenge due to the presence of outliers. To address this, we strategically shift the quantization complexity from activations to weights through a smoothing process. This innovative approach renders both weight and activation quantization more feasible and efficient. Through the synergy of AWQ and SmoothQuant, our solution empowers efficient LLM inference on laptop CPUs with a combination of 4-bit weights and 8-bit activations.
System Support for AWQ
Nevertheless, weight compression does not directly transfer to measured speedup. In quantized linear layers, the multiplication of integers and floating-point numbers is absent from the GPU instruction set architecture (ISA), and it also lacks support within established vendor libraries like cuDNN and cuBLAS. A straightforward alternative involves dequantizing the weights into half precision at runtime and storing them in a temporary DRAM buffer. This approach capitalizes on the availability of established half-precision primitives within vendor libraries. Unfortunately, this approach creates >2x DRAM traffic for weight loading and storing, which in theory translates to >2x slowdown for memory bound workloads.
Therefore, in conjunction with the powerful AWQ algorithm, strong system support becomes indispensable. To this end, we present TinyChat, an efficient and lightweight chatbot for large language models on the edge. TinyChat harnesses the capabilities of TinyChatEngine, an efficient library offering optimized CPU and GPU implementations for quantized linear layers, attention layers and non-linear layers (such as normalization and positional encoding).
Rather than storing dequantized weights in DRAM during runtime, our TinyChatEngine adopts a fundamentally different approach. More precisely, we load integer weights into the on-chip SRAM and perform dequantization directly within the SRAM. Consequently, the DRAM traffic related to weight access sees a reduction of approximately 4x (considering zero points and scaling factors). In practice, this reduction translates to a measured 4x single-layer speedup over the FP16 implementation from cuBLAS, as observed in Meta's LLaMA-30B models, running on a single RTX 3090 GPU.
TinyChat was benchmarked across multiple GPU platforms, including A6000 (server-class GPU), 4090 (desktop GPU), 4070 (laptop GPU) and Orin (edge GPU). For our FP16 baseline, we adopted the default implementation provided by Hugging Face. On the other hand, the INT4 models are quantized using AWQ and implemented using TinyChatEngine GPU kernels. We also apply additional optimization techniques in the latest release. For example, we fuse all the operations in MHA/GQA/MQA into a single kernel, and fuse positional embedding kernels into the attention kernel. We pre-allocate key-value caches to avoid the online memory allocation overhead from Huggingface.
Importantly, TinyChat demonstrates its capability on the NVIDIA Jetson Orin—an energy-efficient edge GPU platform with remarkably compact size. It runs Meta’s latest LLaMA-2-7B model at 30 tokens / second, and seamlessly supports the larger LLaMA-2-13B model at an interactive pace of 17 tokens / second. TinyChat can also be deployed on laptop / desktop GPUs with zero code change. On A6000, TinyChat achieves up to 3.2x speedup for LLaMA-2-chat models, up to 3.3x speedup for Vicuna models and 2.9-3.4x speedup for MPT and Falcon models. All speedup numbers are benchmarked against the FP16 baseline. On the consumer-level 4090, we achieve similar 3.3x, 3.6x, 2.8x and 3.7x speedup for LLaMA-2, Vicuna, MPT and Falcon, respectively.
This blog post delves into the underlying logic driving the creation of TinyChat, aimed at optimizing LLM inference efficiency for edge computing. Applications of LLMs at the edge necessitate low latency, giving rise to a memory-bound workload pattern. Four-bit weight quantization effectively reduces the DRAM traffic by approximately 4x, and our activation-aware weight-only quantization (AWQ) algorithm contributes to bridging the accuracy gap. TinyChatEngine further provides efficient implementation for weight-quantized kernels, which are not supported by vendor libraries.All these innovations open the door for efficient and personalized LLMs on the edge.