SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han

February 21, 2025

TL;DR

SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku.

Related Projects

SVDQuant

AWQ

SmoothQuant

SANA

With Moore's law slowing down, hardware vendors are shifting toward low-precision inference. NVIDIA's latest Blackwell architecture introduces a new 4-bit floating point format (NVFP4), improving upon the previous MXFP4 format. NVFP4 features more precise scaling factors and a smaller microscaling group size (16 v.s. 32), enabling it to maintain 16-bit model accuracy even at 4-bit precision while delivering 4× higher peak performance.

In our previous blog, we shared a tutorial on setting up a 5090 workspace with the Blackwell architecture. In this blog, we’re excited to announce that SVDQuant now supports NVFP4 on the 5090 GPU, delivering better image quality and performance! Our code and demo are all publicly available!

SVDQuant: Absorbing Outliers via Low-Rank Branch

SVDQuant is a new 4-bit quantization paradigm. Compared to other quantization methods which redistribute the outliers between weights and activations, Unlike traditional methods that redistribute outliers between weights and activations, it employs a lightweight high-precision low-rank branch to absorb them.

LaTeX Rendering Example

As illustrated in the above figure, we first aggregate the outliers by migrating them from activation \( \hat{\boldsymbol{X}} \) to weight \( \hat{\boldsymbol{W}} \) via smoothing. Then we apply Singular Value Decomposition (SVD) to the updated weight, \( \hat{\boldsymbol{W}} \), decomposing it into a low-rank branch \( \boldsymbol{L}_1 \boldsymbol{L}_2 \) and a residual \( \hat{\boldsymbol{W}} - \boldsymbol{L}_1 \boldsymbol{L}_2 \). The low-rank branch remains in 16-bit precision, while only the residual—now with reduced outliers and lower magnitude—is quantized to 4 bits.

Image Quality

Model	Precision	Image Reward (↑)	LPIPS (↓)	PSNR (↑)
FLUX.1-dev
	BF16	0.953	—	—
	INT4	0.908	0.322	18.5
	INT4+SVDQuant	0.935	0.223	21.0
	NVFP4	0.928	0.244	20.3
	NVFP4+SVDQuant	0.937	0.208	21.4
FLUX.1-schnell
	BF16	0.938	—	—
	INT4	0.962	0.345	16.3
	INT4+SVDQuant	0.951	0.258	18.3
	NVFP4	0.957	0.280	17.5
	NVFP4+SVDQuant	0.968	0.227	19.0
SANA-1.6B
	BF16	0.952	—	—
	INT4	0.894	0.339	15.3
	INT4+SVDQuant	0.935	0.220	17.8
	NVFP4	0.932	0.237	17.3
	NVFP4+SVDQuant	0.955	0.177	19.0
PixArt-Sigma
	Original	0.944	—	—
	INT4	-1.226	0.762	9.08
	INT4+SVDQuant	0.878	0.323	17.6
	NVFP4	0.660	0.517	14.8
	NVFP4+SVDQuant	0.940	0.271	18.5

The table above compares image quality across various datatypes on four popular text-to-image diffusion models using the MJHQ prompt set. Image Reward assesses overall image quality, while LPIPS and PSNR measure perceptual and numerical similarity between images generated by quantized and original models.

Across all models, NVFP4 outperforms INT4, particularly in similarity metrics, thanks to the native hardware support of smaller microscaling group size on Blackwell. Additionally, SVDQuant consistently improves upon naive quantization, as its low-rank branch effectively absorbs outliers. Notably, combining SVDQuant with NVFP4 delivers the best results, achieving a PSNR of 21.5 on FLUX.1-dev, closely matching the image quality of the original 16-bit model.

Below, we present qualitative comparisons between different precision settings:

Performance

Both INT4 and NVFP4 SVDQuant compress the 12B FLUX.1 model by 3.6×, reducing memory usage by 3.5×. On both RTX 4090 and 5090, SVDQuant achieves 3× speedups over the original BF16 model, even while retaining 16-bit attention.

Notably, the NVFP4 model on the 5090 runs 1.3× faster than the INT4 model on the 4090, aligning with the theoretical 1.3× performance boost from the RTX 4090 to 5090, as documented in NVIDIA’s specifications. These results are just initial benchmarks, and we are actively working on further kernel optimizations.

Furthermore, both our INT4 and NVFP4 kernels are fully open-source. We welcome you to explore and contribute!

Conclusions

In this blog, we introduced SVDQuant’s support for NVFP4 on NVIDIA 50-series GPUs. In the future, we will continue optimizing our kernels and extending support to more models (e.g., video models) to further benefit the community.

‍

Efficient AI Computing,
Transforming the Future.

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

TL;DR

Related Projects

SVDQuant: Absorbing Outliers via Low-Rank Branch

Image Quality

Performance

Conclusions

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,Transforming the Future.

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

TL;DR

Related Projects

SVDQuant: Absorbing Outliers via Low-Rank Branch

Image Quality

Performance

Conclusions

Latest Posts

Categories

Topics

Techniques

Efficient AI Computing,
Transforming the Future.