LongVILA: Scaling Long-Context Visual Language Models for Long Videos

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g. 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

💡 LongVILA Training Strategy

The LongVILA training pipeline consists of five stages: the first three stages: (1) multi-modal alignment using open-sourced datasets, (2) large-scale pre-training to develop general multi-modal capabilities, and (3) supervised fine-tuning for short video understanding—are default processes in VILA, allowing the use of pre-trained Vision-Language Models (VLMs) as a foundation for LongVILA. The subsequent stages involve (4) context extension through continued pre-training on text-only datasets to extend the model's context length incrementally, and (5) long supervised fine-tuning with a dedicated long video dataset using the MM-SP system to enhance long-context understanding and instruction-following abilities, with all parameters trainable in the final stage.

‍

💡 LongVILA SFT Dataset

The LongVILA SFT dataset is designed for training models on long video content, consisting of 15,292 videos from the Shot2Story dataset. Each video is segmented into shorter clips, approximately 10 seconds each, which are individually annotated with captions using the VILA-1.5 model. Subsequently, a large language model generates question-and-answer pairs based on the captions, covering a diverse range of queries such as summarization and content-related questions. The dataset includes a broad variety of video categories, such as Travel, Sports, Education, and Entertainment, as well as question types like spatial, temporal, and synopsis-based queries, ensuring comprehensive coverage for long video understanding and supervised fine-tuning.

**The pipeline for generating instruction-following data from long videos.**

**The proportion of question and video categories in our LongVILA SFT dataset.**

‍

💡 Multi-Modal Sequence Parallelism System

The Multi-Modal Sequence Parallelism (MM-SP) system is an efficient and scalable solution designed to handle the high memory and computation demands of training long-context Vision-Language Models (VLMs). It employs a novel two-stage sharding strategy: first balancing the workload between image encoding and language modeling stages, and then distributing tokens evenly across GPUs for efficient sequence parallelism. Unlike traditional ring-style or hybrid parallelism approaches, MM-SP optimizes intra-node and inter-node communication by using All-to-All and Point-to-Point strategies, enhancing bandwidth utilization and reducing overhead. This system achieves significant speedups (2.1× to 5.7× compared to ring-style systems) and supports context lengths up to 2 million tokens without gradient checkpointing, seamlessly integrating with frameworks like Hugging Face Transformers for both training and inference of long-context multi-modal tasks.

**Workflow of Multi-Modal Sequence Parallelism**

**Sharding strategy and communication pattern of MM-SP**

💡 Experimental Results

The experiment results demonstrate the effectiveness of LongVILA in handling long-context Vision-Language tasks, achieving state-of-the-art performance across nine popular video benchmarks, including VideoMME, ActivityNet-QA, and EgoSchema. LongVILA supports significantly extended context lengths, with training scalability up to 2 million tokens on 256 GPUs, outperforming existing systems like ring-style parallelism and Megatron-LM in both speed (2.1× to 5.7× faster) and sequence length capability. In the Needle-in-a-Haystack experiment, LongVILA achieved 99.8% accuracy on 6,000-frame videos, surpassing the baseline models. Additionally, the MM-SP system improves training efficiency and inference latency, with a 1.3× speedup and the ability to process sequences 2.5× longer compared to Megatron. The model demonstrates robust performance for both short and long video tasks, solidifying its capability for comprehensive video understanding.

‍

**99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack**

**Performance comparison of training systems on 32 H100 GPUs. MM-SP is as scalable as ZigZag-RingAttn, and as efficient as Ulysses**

‍

Video

Citation

@inproceedings{longvila, title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos}, author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Ethan He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han}, booktitle = {International Conference on Learning Representations}, year = {2025}, series = {ICLR '25} }

‍

Media

No media articles found.

Acknowledgment

Team Members

Yukang Chen

Qinghao Hu

Haotian Tang

Shang Yang

Song Han