We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
• Deep Compression Autoencoder: We introduce a new Deep Compressinon Autoencoder (DC-AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens, which is crucial for efficient training and generating ultra-high-resolution images, such as 4K resolution.
• Efficient Linear DiT: We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N2) to O(N) Mix-FFN, with 3×3 depth-wise convolution in MLP, enhances the local information of tokens. Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7× in latency. Mix-FFN also removes the need for positional encoding (NoPE) without quality loss, marking the first DiT without positional embedding.
• Decoder-only Small LLM as Text Encoder: We use Gemma, a decoder-only LLM, as the text encoder to enhance understanding and reasoning in prompts. Unlike CLIP or T5, Gemma offers superior text comprehension and instruction-following. We address training instability and design complex human instructions (CHI) to leverage Gemma’s in-context learning, improving image-text alignment.
• Efficient Training and Inference Strategy: We propose automatic labeling and training strategies to improve text-image consistency. Multiple VLMs generate diverse re-captions, and a CLIPScore-based strategy selects high-CLIPScore captions to enhance convergence and alignment. Additionally, our Flow-DPM-Solver reduces inference steps from 28-50 to 14-20 compared to the Flow-Euler-Solver, with better performance.
We compare Sana with the most advanced text-to-image diffusion models in Table 1. For 512 × 512 resolution, Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, Sana is considerably stronger than most models with <3B parameters and excels in inference latency. Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.
@misc{xie2024sana,
title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Haotian Tang and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
year={2024},
eprint={2410.10629},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.10629},
}
We would like to thank Shuchen Xue from UCAS, Cheng Lu from OpenAI, Jincheng Yu from HKUST, and Chongjian Ge from HKU for discussions and data collection.