Efficient AI Computing,
Transforming the Future.

Large Language Models (LLMs)

Projects

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

QServe accelerates large-scale LLM serving on GPUs with QoQ (W4A8KV4) quantization, boosting the generation throughputs by up to 3x over the state-of-the-art solution.

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

We propose COAT, a memory efficient FP8 training method for large language models.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

By selectively applying full attention to critical attention heads and using "Streaming Attention" on others, DuoAttention significantly reduces both pre-filling and decoding memory usage and latency for long-context LLMs, while maintaining their long-context capabilities.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

LongVILA is a full-stack solution for long VLM, which incorporates novel training strategies, dataset, and the Multi-Modal Sequence Parallelism (MM-SP) system to efficiently handle long video understanding, achieving significant scalability, accuracy, and speed improvements on multi-modal benchmarks.

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation.

Blog Posts

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

December 12, 2024

Explore the latest advancement in TinyChat – the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.

Block Sparse Attention

October 10, 2024

We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified based on FlashAttention 2.4.2.

TinyChat: Visual Language Models & Edge AI 2.0

Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

TinyChat: Large Language Model on the Edge

September 6, 2023

Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.

Categories

Efficient AI Algorithm

Efficient AI Hardware & System

Topics

Large Language Models (LLMs)

Autonomous Driving

Quantum Computing

Techniques

Pruning & Sparsity

Data Augmentation

Neural Architecture Search (NAS)

New Architecture

ML for Hardware & System