Projects

LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

HPCA 2025

Yujun Lin*, Zhekai Zhang*, Song Han

Modern tensor applications, especially foundation models and generative AI applications require multiple input modalities (both vision and language), which increases the demand for flexible accelerator architecture. Existing frameworks suffer from the trade-off between design flexibility and productivity of RTL generation: either limited to very few hand-written templates or cannot automatically generate the RTL. To address this challenge, we propose the LEGO framework, which targets tensor applications and automatically generates spatial architecture design and outputs synthesizable RTL code without handwritten RTL design templates. Leveraging the affine-transformation-based architecture representation, LEGO front end finds interconnections between function units, synthesizes the memory system, and fuses different spatial dataflow designs based on data reuse analysis. LEGO back end then translates the hardware in a primitive-level graph to perform lower-level optimizations, and applies a set of linear-programming algorithms to optimally insert pipeline registers and reduce the overhead of unused logic when switching spatial dataflows. Our evaluation demonstrates that LEGO can achieve 3.2× speedup and 2.4× energy efficiency compared to previous work Gemmini, and can generate one architecture for diverse modern foundation models in generative AI applications.

More Close

LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

HPCA 2025

(

)

LEGO is an automatic RTL generator for AI accelerators.

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

ECCV 2024

Zhijian Liu*, Zhuoyang Zhang*, Samir Khaki, Shang Yang, Haotian Tang, Chenfeng Xu, Kurt Keutzer, Song Han

Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. Our "dense+sparse'' paradigm paves the way for efficient high-resolution visual computing.

More Close

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

ECCV 2024

(

)

SparseRefine is a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. It achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy.

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

International Journal of Computer Vision 2024

Guangxuan Xiao*¹, Tianwei Yin*¹, William T. Freeman¹, Frédo Durand¹, Song Han¹

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300x-2500x speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation.

More Close

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

International Journal of Computer Vision 2024

(

)

We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning.

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

ISCA 2024

Hanrui Wang, Pengyu Liu, Daniel Bochen Tan, Yilian Liu, Jiaqi Gu, David Z. Pan, Jason Cong, Umut A. Acar, Song Han

The neutral atom array has gained prominence in quantum computing for its scalability and operation fidelity. Previous works focus on fixed atom arrays (FAAs) that require extensive SWAP operations for long-range interactions. This work explores a novel architecture reconfigurable atom arrays (RAAs), also known as field programmable qubit arrays (FPQAs), which allows for coherent atom movements during circuit execution under some constraints. Such atom movements, which are unique to this architecture, could reduce the cost of long-range interactions significantly if the atom movements could be scheduled strategically. In this work, we introduce Atomique, a compilation framework designed for qubit mapping, atom movement, and gate scheduling for RAA. Atomique contains a qubit-array mapper to decide the coarse-grained mapping of the qubits to arrays, leveraging MAX k-Cut on a constructed gate frequency graph to minimize SWAP overhead. Subsequently, a qubit-atom mapper determines the fine-grained mapping of qubits to specific atoms in the array and considers load balance to prevent hardware constraint violations. We further propose a router that identifies parallel gates, schedules them simultaneously, and reduces depth. We evaluate Atomique across 20+ diverse benchmarks, including generic circuits (arbitrary, QASMBench, SupermarQ), quantum simulation, and QAOA circuits. Atomique consistently outperforms IBM Superconducting, FAA with long-range gates, and FAA with rectangular and triangular topologies, achieving significant reductions in depth and the number of two-qubit gates.

More Close

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

ISCA 2024

(

oral

)

We develop a new compiler for the emerging reconfigurable neutral atom array (FPQA) device.

Efficient AI Computing,Transforming the Future.

LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

Atomique: A Quantum Compiler for Reconfigurable Neutral Atom Arrays

Efficient AI Computing,
Transforming the Future.