Efficient AI Computing,
Transforming the Future.

TinyChat: Visual Language Models & Edge AI 2.0

tl;dr

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

Related Projects

TinyChat and AWQ now supports running VLMs (e.g. VILA) efficiently on the edge

Edge AI 1.0: The Limitations

Early iterations of edge AI revolved around deploying compressed AI models onto edge devices. This phase, known as Edge AI 1.0, focused on task-specific models. The challenge with this approach lay in the need to train different models with different datasets, where 'negative samples' are hard to collect and outlier situations are difficult to handle. This process was time-consuming and highlighted the need for more adaptable AI solutions with better generalization.

Edge AI 2.0: The Rise of Generalization

Edge AI 2.0 marks a shift towards increased adaptability, powered by foundational Visual Language Models (VLMs). VLMs demonstrate incredible versatility, understanding complex instructions and swiftly adapting to new scenarios. This flexibility positions them as vital tools in a wide array of applications. They can optimize decision-making in self-driving vehicles, create personalized interactions within IoT/AIoT environments, and enhance smart home experiences. The core strength of VLMs lies in their 'world knowledge' acquired during language pre-training.

The Solution for VLM on the Edge: AWQ + TinyChat

The need for real-time processing and privacy protection makes deploying VLMs on edge devices essential. Unlike text, images can contain highly sensitive personal data, making cloud-based solutions a privacy risk. Edge-based VLMs improve responsiveness and efficiency, vital for quick decision-making in smart environments and autonomous systems. However, their large size makes them ill-suited for the resource constraints of edge devices. Model compression, specifically the quantization of model weights, is crucial for successful deployment. Yet, standard 4-bit quantization can drastically degrade the accuracy of large models. To address this, we developed AWQ (Activation-aware Weight Quantization, MLSys'24). This technique innovatively uses activation-aware scaling factors to safeguard important weights during quantization. AWQ empowers us to quantize VLMs to 4-bit precision with negligible accuracy loss, paving the way for VLMs to transform edge computing while upholding performance standards.

Despite advancements like AWQ, deploying large language and visual models on edge devices remains a complex task. 4-bit weights lack byte alignment and demand specialized computation for optimal efficiency. TinyChat stands out as a remarkably efficient inference framework designed specifically for LLMs and VLMs on edge devices. TinyChat's adaptable nature allows it to run on various hardware platforms, from NVIDIA RTX 4070 laptop GPUs to NVIDIA Jetson Orin, attracting significant interest from the open-source community. Now, TinyChat expands its reach to support our Visual Language Model, VILA, enabling the vital understanding and reasoning of visual data. TinyChat delivers exceptional efficiency and flexibility in combining textual and visual processing, empowering edge devices to execute cutting-edge, multi-modal tasks.

Flexible framework supporting multi-platforms

TinyChat offers seamless support for a wide range of edge devices, including desktop GPUs (like RTX 4090), laptop GPUs (like RTX 4070), and mobile GPUs (like Jetson Orin). TinyChat is open-sourced, featuring a fully Python-based runtime, empowering users with exceptional flexibility for deployment and customization.

VILA-13B (4-bit) running on RTX 4090 with TinyChat
VILA-13B (4-bit) running on Jetson Orin with TinyChat

Multi-image reasoning & In-context learning

TinyChat's newest release leverages VILA's impressive multi-image reasoning capabilities, enabling users to upload multiple images simultaneously for enhanced interactions. This unlocks exciting possibilities – as demonstrated below, VILA can understand the content and order of image sequences, opening new avenues for creative applications.

VILA-13B (4-bit) on multi-image understanding

VILA also demonstrates remarkable in-context learning abilities. Without the need for explicit system prompts, VILA can seamlessly infer patterns from previous image-text pairs to generate relevant text for new image inputs. In the demo video below, VILA successfully recognizes the NVIDIA logo and, mirroring the style of previous examples, outputs NVIDIA's most famous products.

VILA-13B (4-bit) on in-context learning task

TinyChat with Gradio UI

Alongside the release of TinyChat and VILA, we've developed a user-friendly Gradio UI, making it incredibly easy to interact with the VILA model. Simply upload your images and get instant feedback from VILA – regardless of your own device's computational power. The gradio UI offers multiple interaction modes, empowering you to explore the full range of VILA's features, including multi-image reasoning, in-context learning, chain-of-thought, and more!

The Gradio UI can also be deployed onto your own device, such as a laptop with NVIDIA RTX 4070 GPU, making the visual language models more accessible!

Quantization Evaluation

Thanks to AWQ, we are able to quantize VILA/LLaVA models to INT4 and deploy them onto edge devices efficiently with TinyChat. We also evaluated AWQ's performance on quantizing visual language models, and the results confirm that AWQ can handle the VILA models very well, preserving accuracy while boosting efficiency.

Inference Speed

We also evaluated TinyChat's inference speed for Visual Language Models (VILA). Compared to the FP16 baseline, TinyChat maintains a 3x speedup on edge devices (measured in tokens/sec).

TinyChat is efficient for VLM inference. Speed numbers are measured in tokens/sec

Conclusion

We're thrilled to announce a groundbreaking new feature for TinyChat: comprehensive support for Visual Language Models (VLMs). TinyChat, your one-stop shop for efficient on-device VLM quantization and deployment, remains remarkably flexible and open-source under the permissive MIT license. This grants users complete control to customize deployments for their specific requirements. Further democratizing VLM use, we've seamlessly integrated the user-friendly Gradio UI within TinyChat. The powerful combination of AWQ and TinyChat unlocks a new era of edge AI 2.0 applications, empowering users to leverage the power of VLMs for innovative creations.