MIT HAN Lab

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

ICLR 2025

Enze Xie¹*, Junsong Chen¹*, Junyu Chen²³, Han Cai¹, Haotian Tang², Yujun Lin², Zhekai Zhang², Muyang Li², Ligeng Zhu¹, Yao Lu¹, Song Han¹²

Efficient AI Algorithm

Generative AI

New Architecture

Quantization

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

More Close

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

ICLR 2025

(

oral

)

SANA is an efficient linear DiT that can generate images up to 4096 × 4096. SANA delivers: 20x smaller & 100x faster than FLUX; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ICLR 2025

Guangxuan Xiao¹, Jiaming Tang¹, Jingwei Zuo², Junxian Guo¹ ³, Shang Yang¹, Haotian Tang¹, Yao Fu⁴, Song Han¹ ⁵

Efficient AI Algorithm

Large Language Models (LLMs)

New Architecture

Pruning & Sparsity

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks, referred to as Streaming Heads, do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU.

More Close

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ICLR 2025

(

)

By selectively applying full attention to critical attention heads and using "Streaming Attention" on others, DuoAttention significantly reduces both pre-filling and decoding memory usage and latency for long-context LLMs, while maintaining their long-context capabilities.

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

ICLR 2025

Junyu Chen*, Han Cai*, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han

Efficient AI Algorithm

Generative AI

New Architecture

We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder.

More Close

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

ICLR 2025

(

)

A new family of high-spatial compression autoencoders for accelerating high-resolution diffusion models.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

ICLR 2025

Yukang Chen¹*, Fuzhao Xue¹*, Dacheng Li¹³*, Qinghao Hu²*, Ligeng Zhu¹, Xiuyu Li¹³, Yunhao Fang¹, Haotian Tang¹², Shang Yang¹², Zhijian Liu¹, Ethan He¹, Hongxu Yin¹, Pavlo Molchanov¹, Jan Kautz¹, Linxi Fan¹, Yuke Zhu¹⁴, Yao Lu¹, Song Han¹²

Efficient AI Hardware & System

Training

Generative AI

Large Language Models (LLMs)

New Architecture

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

More Close

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

ICLR 2025

(

)

LongVILA is a full-stack solution for long VLM, which incorporates novel training strategies, dataset, and the Multi-Modal Sequence Parallelism (MM-SP) system to efficiently handle long video understanding, achieving significant scalability, accuracy, and speed improvements on multi-modal benchmarks.

Efficient AI Computing,
Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,
Transforming the Future.