MIT HAN Lab

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024

Ji Lin*, Jiaming Tang*, Haotian Tang*, Shang Yang*, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han

Efficient AI Hardware & System

Efficient AI Algorithm

TinyML

Large Language Models (LLMs)

Quantization

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPU (NVIDIA Jetson Orin 64GB).

More Close

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MLSys 2024

(

)

Low-bit weight-only quantization for LLMs.

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

ICLR 2024

Yukang Chen¹, Shengju Qian¹, Haotian Tang², Xin Lai¹, Zhijian Liu², Song Han², Jiaya Jia¹

Efficient AI Algorithm

Training

Generative AI

Large Language Models (LLMs)

New Architecture

Pruning & Sparsity

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.

More Close

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

ICLR 2024

(

)

LongLoRA takes advantage of shifted sparse attention to greatly reduce the finetuning cost of long context LLMs.

Tiny Machine Learning Projects

NeurIPS 2020/2021/2022, MICRO 2023, ICML 2023, MLSys 2024, IEEE CAS Magazine 2023

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Han Cai, Guangxuan Xiao, Haotian Tang, Shang Yang, Yujun Lin, and Song Han

Efficient AI Hardware & System

Efficient AI Algorithm

Training

TinyML

Generative AI

Large Language Models (LLMs)

Quantization

Pruning & Sparsity

Neural Architecture Search (NAS)

Deep neural networks have evolved to be the state-of-the-art technique for many machine learning applications. For example, large language models (LLMs) like GPT and LLaMA are transformative in natural language processing and convolutional neural networks (CNNs) are effective in image recognition, offering excellent performance across a range of tasks. However, these algorithms require massive computational resources and carbon footprint, making them unsuitable for direct deployment on edge devices, such as laptops, mobile phones and microcontrollers (MCUs). There is a critical need to create tiny machine learning (TinyML) that can run efficiently in on-device environments, enabling advanced AI capabilities on personal devices to protect user/data privacy without relying on cloud computing. This project aims to address these issues by innovating model compression techniques as well as high-performance system design for efficient AI computing. In this project, we propose multiple techniques, including AWQ, SmoothQuant, PockEngine, MCUNetV1, MCUNetV2, MCUNetV3, and TinyTL to largely shrink AI model size and to deploy and run efficiently on edge devices.

More Close

Tiny Machine Learning Projects

NeurIPS 2020/2021/2022, MICRO 2023, ICML 2023, MLSys 2024, IEEE CAS Magazine 2023

(

Feature

)

This TinyML project aims to enable efficient AI computing on the edge by innovating model compression techniques as well as high-performance system design.

Tiny Machine Learning: Progress and Futures [Feature]

IEEE CAS magazine

Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, and Song Han

Efficient AI Hardware & System

Efficient AI Algorithm

Training

New Architecture

Quantization

Neural Architecture Search (NAS)

Tiny machine learning (TinyML) is a new frontier of machine learning. By squeezing deep learning models into billions of IoT devices and microcontrollers (MCUs), we expand the scope of AI applications and enable ubiquitous intelligence. However, TinyML is challenging due to the hardware constraints: the tiny memory resource is difficult hold deep learning models designed for cloud and mobile platforms. There is also limited compiler and inference engine support for bare-metal devices. Therefore, we need to co- design the algorithm and system stack to enable TinyML. In this review, we will first discuss the definition, challenges, and applications of TinyML. We then survey the recent progress in TinyML and deep learning on MCUs. Next, we will introduce MCUNet, showing how we can achieve ImageNet-scale AI applications on IoT devices with system-algorithm co-design. We will further ex- tend the solution from inference to training and introduce tiny on-device training techniques. Finally, we present future directions in this area. Today’s “large” model might be tomorrow’s “tiny” model. The scope of TinyML should evolve and adapt over time.

More Close

Tiny Machine Learning: Progress and Futures [Feature]

IEEE CAS magazine

(

feature

)

We discuss the definition, challenges, and applications of TinyML.

Efficient AI Computing,
Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Tiny Machine Learning Projects

Tiny Machine Learning: Progress and Futures [Feature]

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,Transforming the Future.

Who We Are

Highlights

We Work On

News

Our Full-Stack Projects

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Tiny Machine Learning Projects

Tiny Machine Learning Projects

Tiny Machine Learning: Progress and Futures [Feature]

Tiny Machine Learning: Progress and Futures [Feature]

Our Impacts

Featured Videos

Latest Blog Posts

SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs

RTX 5090 Workstation Configuration Journey

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup

Block Sparse Attention

Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D

Efficient AI Computing,
Transforming the Future.