Latest Research Papers
2025-01-22
arXiv
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3, a vision-centric multimodal foundation model, enhances image and video understanding through a four-stage training process that leverages high-quality image-text data. The model's design allows for the encoding of variable-resolution images and compact representation of videos, leading to superior performance in benchmarks.
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation
model for image and video understanding. The core design philosophy of
VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the
vision-centric training paradigm and vision-centric framework design. The key
insight of our vision-centric training paradigm is that high-quality image-text
data is crucial for both image and video understanding. Instead of preparing
massive video-text datasets, we focus on constructing large-scale and
high-quality image-text datasets. VideoLLaMA3 has four training stages: 1)
Vision Encoder Adaptation, which enables vision encoder to accept images of
variable resolutions as input; 2) Vision-Language Alignment, which jointly
tunes the vision encoder, projector, and LLM with large-scale image-text data
covering multiple types (including scene images, documents, charts) as well as
text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT
data for downstream tasks and video-text data to establish a foundation for
video understanding. 4) Video-centric Fine-tuning, which further improves the
model's capability in video understanding. As for the framework design, to
better capture fine-grained details in images, the pretrained vision encoder is
adapted to encode images of varying sizes into vision tokens with corresponding
numbers, rather than a fixed number of tokens. For video inputs, we reduce the
number of vision tokens according to their similarity so that the
representation of videos will be more precise and compact. Benefit from
vision-centric designs, VideoLLaMA3 achieves compelling performances in both
image and video understanding benchmarks.