Efficient AI Computing,
Transforming the Future.

Projects

To choose projects, simply check the boxes of the categories, topics and techniques.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

ArXiv 2024
 (
)

We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache, and implement QServe inference library that improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, surpassing the leading industry solution TensorRT-LLM.

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

CVPR 2024
 (
Highlight
)

A training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality.

VILA: On Pre-training for Visual Language Models

CVPR 2024
 (
)

VILA is a visual language model (VLM) pre-trained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge.

Condition-Aware Neural Network for Controlled Image Generation

CVPR 2024
 (
)

A new conditional control method for diffusion models by dynamically adapting their weight.