Pruning & Sparsity

Projects

XAttention: Block Sparse Attention with Antidiagonal Scoring

ICML 2025

A plug-and-play method that uses antidiagonal sums to efficiently identify important parts of the attention matrix, achieving up to 13.5x speedup on long-context tasks with comparable accuracy to full attention.

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

ArXiv 2025

(

)

A O(nlogn) Sparse Attention Mask for Long Video Generation

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

MLSys 2025

(

)

LServe accelerates long-sequence LLM serving with unified sparse attention for both prefilling and decoding, achieving up to 3.3× speedup over state-of-the-art solution without sacrificing accuracy.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

ICLR 2025

(

)

By selectively applying full attention to critical attention heads and using "Streaming Attention" on others, DuoAttention significantly reduces both pre-filling and decoding memory usage and latency for long-context LLMs, while maintaining their long-context capabilities.

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

ECCV 2024

(

)

SparseRefine is a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. It achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy.

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

A training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality.

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

July 3, 2025

A sparse attention mechanism with O(nlogn) computational complexity for long video generation. It can speed up both training and inference by 2–4×. The code is available at https://github.com/mit-han-lab/radial-attention

Block Sparse Attention

October 10, 2024

We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified based on FlashAttention 2.4.2.

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

March 10, 2024

In this blog, we introduce Patch Conv to reduce memory footprint when generating high-resolution images. PatchConv significantly cuts down the memory usage by over 2.4× compared to existing PyTorch implementation. Code: https://github.com/mit-han-lab/patch_conv

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

February 29, 2024

In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight. Code: https://github.com/mit-han-lab/distrifusion

On-Device Training Under 256KB Memory

November 28, 2022

In MCUNetV3, we enable on-device training under 256KB SRAM and 1MB Flash, using less than 1/1000 memory of PyTorch while matching the accuracy on the visual wake words application. It enables the model to adapt to newly collected sensor data and users can enjoy customized services without uploading the data to the cloud thus protecting privacy.

Topics

Large Language Models (LLMs)

Techniques

Pruning & Sparsity

Data Augmentation

Quantization

Neural Architecture Search (NAS)

Distillation

New Architecture

ML for Hardware & System

Efficient AI Computing,
Transforming the Future.

Pruning & Sparsity

Projects

XAttention: Block Sparse Attention with Antidiagonal Scoring

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

On-Device Training Under 256KB Memory

Categories

Topics

Techniques

Efficient AI Computing,Transforming the Future.

Pruning & Sparsity

Projects

XAttention: Block Sparse Attention with Antidiagonal Scoring

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Blog Posts

Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

On-Device Training Under 256KB Memory

Categories

Topics

Techniques

Efficient AI Computing,
Transforming the Future.