Explore the latest advancement in TinyChat – the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.
In AWQ and TinyChat, we apply W4A16 quantization to Large Language Models (LLMs), aiming to improve the arithmetic intensity and theoretical throughputs of GEMMs. However, for the context stage, the computational arithmetic intensity is already sufficiently high, rendering weight-only quantization ineffective for further performance improvements. This becomes particularly critical as long-context inputs have emerged as a fundamental requirement in modern AI applications, especially in Visual Language Models (VLMs). Contemporary VLMs are tackling increasingly complex tasks such as detailed image captioning, visual question answering, and multi-frame video understanding, where preserving long-range contextual information is paramount for achieving high-quality, contextually accurate outputs. For instance, the following image showcases the breakdown of context stage and decoding stage in a real-world robotics application, where the next action is predicted based on the input of 8 images, demonstrating the significance of optimizing the prefilling speed of edge LLM In TinyChat 2.0, we aim to optimize the prefilling speed of LLM/VLMs in adaption to these latency-sensitive long-context applications.
To optimize the prefilling speed, the simplest approach is maintaining the model in FP16, as it eliminates dequantization overhead during the context stage. Moreover, one could also perform uniform quantization for both weight and activation (e.g., W8A8), to utilize the power of low-bit tensor cores for GEMM. However, on edge devices, there are two critical challenges for both FP16 and uniform quantization approaches: 1) substantial memory consumption, and 2) increased decoding latency. To effectively deploy LLMs and VLMs on memory- and compute-constrained edge devices, weight-only quantization (e.g., W4A16) still emerges as the optimal strategy, striking a superior balance between model accuracy, efficiency, and memory footprint.
Therefore, we continue to optimize the W4A16 GEMM for the context stage in TinyChat 2.0. Our analysis reveals that for GEMM in LLMs, the risk of overflow with FP16 accumulation is low. Consider the NVIDIA RTX 4090 as an example: the peak throughputs of tensor cores are 165.2 TFLOPS with FP32 accumulation and 330.3 TFLOPS with FP16 accumulation, respectively. Upgrading the context stage matrix multiplications to FP16 accumulation further exploits the performance of modern architectures. Furthermore, we also notice that in the long-context scenario, attention operations significantly contribute to end-to-end latency, particularly as sequence lengths increase. We have integrated the latest FlashAttention implementation into TinyChat, enabling more efficient attention computation.
With the above optimizations, TinyChat 2.0 now achieves the state-of-the-art prefilling speed on edge devices, without hurting the performance of the decoding stage.
For example, on RTX 4090, we benchmark the Time-To-First-Token (TTFT) of Llama-3-8B and Llama-2-7B models:
We also compare TinyChat 2.0 versus other LLM inference systems. The results show that TinyChat now achieves superior TTFT over the existing baseline systems across different input sequence lengths.
In TinyChat 2.0, we introduce chunk-prefilling optimization for multi-round dialogues. For multi-turn inputs, TinyChat will reuse the KV Cache from previous conversations without recomputing them. This optimization eliminates redundant computations and significantly reduces the Time-To-First-Token (TTFT) for subsequent interaction rounds. And we hope that this feature could be beneficial to emerging edge LLM/VLM applications such as robotics where the input visual contents are not feed into the model all at once.
To evaluate Context Streaming, we measure the TTFT in multi-round conversations with a fixed question length of 32 tokens, and varying history lengths from 16 to 1024 tokens. Specifically, in TinyChat 2.0, all history tokens are already prefilled to the existing KV Cache when processing the current query, while baseline systems recompute the history tokens for each query.
On RTX 4090, we evaluate the effectiveness of Context Streaming with Llama-3-8B. And the results are as follows:
In TinyChat 2.0, we also support NVILA, a latest family of open VLMs with state-of-the-art capabilities. TinyChat 2.0 serves as a specialized inference engine for NVILA with extensive quantization support. Specifically, in the compute-bounded prefilling stage, NVILA applies token compression techniques to reduce the inference workload for LLM backbone, after which the vision tower becomes the primary bottleneck, accounting for over 90% of the prefilling latency. To tackle this, we implement W8A8 quantization for the vision tower to reduce NVILA's Time-To-First-Token (TTFT) in this compute-bounded stage. For the memory-bounded decoding stage, we apply W4A16 quantization of the LLM backbone to accelerate, with the latest advancements in TinyChat 2.0 such as FP16 accumulation. As a result, TinyChat 2.0 achieves a total 1.7x kernel speedup without compromising accuracy on NVILA models. The detailed results are as follows:
TinyChat 2.0 serves as a significant milestone for our full-stack, hardware-friendly deployment solution for LLM/VLM on edge devices. Building upon the previous version's 3-4x decoding speedups and 4x memory savings, TinyChat 2.0 further accelerates the prefilling speed of LLMs and VLMs to the state-of-the-art level. With the emergence of advanced applications such as high-resolution image captioning, long-video understanding, and robotic trajectory decision-making, etc., processing long sequences on edge devices becomes increasingly critical. We envision TinyChat 2.0 as a powerful tool to democratize long-context edge AI, and we hope it could lay the groundwork for future technological innovations and transformative applications.