Efficient AI Computing,
Transforming the Future.

Projects

To choose projects, simply check the boxes of the categories, topics and techniques.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

ArXiv 2024
 (
)

We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache, and implement QServe inference library that improves the maximum achievable serving throughput of Llama-3-8B by 1.2× on A100, 1.4× on L40S; and Qwen1.5-72B by 2.4× on A100, 3.5× on L40S, surpassing the leading industry solution TensorRT-LLM.

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

ISCA 2024
 (
oral
)

We develop a new compiler for the emerging reconfigurable neutral atom array (FPQA) device.

Q-Pilot: Field Programmable Qubit Array Compilation with Flying Ancillas

DAC 2024
 (
oral
)

We develop a compiler for emerging reconfigurable neutral atom array quantum hardware, with ancilla qubits.

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

CVPR 2024
 (
Highlight
)

A training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality.